What are the specifications of the cluster?
There are 32 nodes with 24 cores each for 768 total computational cores. Each node contains two Intel E5-2650 v4 12-core 2.2 GHz Xeon processors and 128GB of DDR4 ECC RAM (5.33GB of RAM per core). The master head node has two E5-2620 v4 8-core Xeon processors and 32GB of RAM. It should only be used for compiling and setting up your jobs. You should design how you split up your jobs accordingly. The system is running CentOS 6.8 with the Intel 2017 Parallel Studio XE compilers as well as MATLAB 2017a.
The nodes are interconnected with a Mellanox SwitchX MSX6036F-1SFS 56Gbps Infiniband switch as well as an EdgeCore 1 Gbps managed L3 ethernet switch for high speed communications between nodes. The cluster is connected via a 1 Gbps link to a university managed Cisco switch in a Washington University private subnet.
Each node also has a 240GB local SSD available for writing temporary files. If you want access to it, ask Hugh and he will create a /scratch/username folder for you to use to write intermediate files. You need to include lines in your batch script to save the files you need and delete the ones you do not need. Files left on /scratch will be purged periodically.
How can I access and login to the cluster?
The Linux cluster is named TELLUS and is only available for SSH access from the Washington University campus network. If you are connecting from off-campus you must login through a campus VPN service, (the Danforth Campus VPN) or via another externally accessible Linux server. You can then connect by using a SSH connection to the cluster’s specific IP address. Mac users can just use a terminal, Windows users can download and use MobaXterm. Mac users should be sure to install XQuartz if displaying any graphical applications. Linux and Mac users should make sure to use the “-X” or “-Y” flag of ssh if you want to display any graphical applications on your local machine.
Where is my home directory and how much storage do I have?
Your home directory lives on the master node in /home/<username>. Your directory lives on a shared 30TB filesystem on a RAID5 striped volume on the master node. There are no storage limits on this volume, but please do clean up unneeded files periodically. This file system is not backed up so when you delete anything the files are gone forever. There is also a separate attached 3.84TB SSD where users have space as /master-ssd/<username> for faster read/write performance. The SSD will improve performance specifically for multi-process operations with multiple simultaneous read and write operations to disk.
How do I transfer my files to the cluster?
Transfer files using the SFTP protocol, built-in to MobaXterm, CyberDuck, FileZilla or command-line sftp. Users with accounts on the seismology servers will also find /P available via NFS on the cluster for easy transferring of files between terra1 and mantle.
How do I select among the different compilers on the cluster?
Most people will probably use the Intel compilers so you should include the line:
module load intel/parallel-studio-xe-2017
in your .bashrc or .cshrc and in all your sbatch scripts to set your environment. You can type module avail to see available compilers and environments that are available. Make sure the line:
is in your .bashrc or .cshrc
How do I optimize my code for the cluster?
To optimize your code for our specific processors use the Intel flag: -xCORE-AVX2 when compiling your code.
$ mpif90 -xCORE-AVX2 -o myexecutable mycode.f90
How do I run other applications on the cluster?
Additional packages are installed on /opt/local/ on the master and nodes. Add source lines to your .bashrc to set the directories and paths necessary to run them. The environment setup scripts are located in /opt/local/sys_logins. For example, to run MATLAB add the following line to your .bashrc
How do I submit jobs to the nodes?
All jobs must be submitted using the SLURM command sbatch. DO NOT run computationally intensive jobs directly on the master node! You must create a batch script and then submit it using sbatch to one of the available partitions. There are currently 3 separate partitions
- sbatch – the default 4 nodes for general use jobs requiring only one node
- seismo – 6 node partition for slightly larger jobs requiring 1-3 nodes and longer time periods
- xlarge22 – 22 node partition for large jobs requiring many nodes that will run for a long time. Only use this partition if you really need it.
We have some example scripts available that people have used in the past. Let Hugh know if you would like to see some of these scripts.
What other SLURM commands are there?
- squeue: Show queue contents (what jobs are running, nodes used)
- sinfo: Show queue information
- scancel: Cancel a queued job
Can I check the status of the CLUSTER remotely?
If you are on campus or have VPN access you can enter the IP address of the cluster in a browser to see the current status.
General users and folks off campus without VPN access can see some general information here.
How do I get more help on the cluster?
For general account and usage information contact Hugh Chou. Other graduate students or post-doctoral associates may be more familiar with the actual code and algorithms for optimizing your particular code.