Systems: Difference between revisions
No edit summary |
No edit summary |
||
Line 136: | Line 136: | ||
'''Storage''' | '''Storage''' | ||
The user's home directory (/home), scratch directory (/scratch), and each group's work directory (/work) on the Slurm test cluster are the same file systems as on Sapelo2. So there is no need to transfer data between Sapelo2 and Slurm test cluster. | The user's home directory (/home), scratch directory (/scratch), and each group's work directory (/work) on the Slurm test cluster are the same file systems as on Sapelo2. So there is no need to transfer data between Sapelo2 and Slurm test cluster. If you have Sapelo2 specific setting in your dotfiles (for example in .bashrc or in software specific configuration files), those might need to get changed when you work on Sap2test. | ||
However, Sapelo2's /usr/local file system and therefore the applications installed on Sapelo2 are not available on the Slurm test cluster. | However, Sapelo2's /usr/local file system and therefore the applications installed on Sapelo2 are not available on the Slurm test cluster. |
Revision as of 07:38, 13 July 2020
Sapelo2
Sapelo2 is a Linux cluster that runs a 64-bit CentOS 7.5 operating system and it is managed using Foreman and Puppet. Two physical login nodes are available, with Intel Xeon E5-2680 v3 (Haswell) processors and 128GB of RAM and 24 cores per node.
For a subset of compute nodes, internodal communication among them and between these nodes and the storage systems serving the home directories and the scratch directories is provided by a QDR Infiniband network(40Gbps). For another subset of compute nodes, these communications are provided by an EDR Infiniband network.
The cluster is currently comprised of the following resources:
Regular nodes
- 106 compute nodes with AMD Opteron processors (48 cores and 128GB of RAM per node)
- 22 compute nodes with AMD EPYC (Rome) processors (64 cores and 128GB of RAM per node)
- 16 compute nodes with AMD EPYC processors (32 cores and 128GB of RAM per node)
- 42 compute nodes with Intel Xeon Skylake processors (32 cores and 192GB of RAM per node)
- 32 compute nodes with Intel Xeon Broadwell processors (28 cores and 64GB of RAM per node)
- 4 compute nodes with AMD Opteron processors (48 cores and 256GB of RAM per node)
High memory nodes (1TB/node)
- 4 compute nodes with AMD EPYC processors (64 cores and 1TB of RAM per node)
- 4 compute nodes with Intel Xeon Broadwell processors (28 cores and 1TB of RAM per node)
- 1 compute node with AMD Opteron processors (48 cores and 1TB of RAM per node)
High memory nodes (512GB/node)
- 16 compute nodes with AMD EPYC processors (32 cores and 512GB of RAM per node)
- 6 compute nodes with AMD Opteron processors (48 cores and 512GB of RAM per node)
GPU nodes
- 4 compute nodes with Intel Xeon Skylake processors (32 cores and 187GB of RAM) and 1 NVIDIA P100 GPU card per node
- 2 compute nodes with Intel Xeon processors (16 cores and 128GB of RAM) and 8 NVIDIA K40m GPU cards per node
- 4 compute nodes with Intel Xeon processors (12 cores and 96GB of RAM) and 7 NVIDIA K20Xm GPU cards per node
Buy-in nodes
- Various configurations
The queueing system on Sapelo2 is Torque/Moab.
Connecting to Sapelo2
Transferring Files
Disk Storage
Software Installed on Sapelo2
Code Compilation on Sapelo2
Running Jobs on Sapelo2
Monitoring Jobs on Sapelo2
Slurm Test Cluster (Sap2test)
GACRC is planning to switch the queueing system on Sapelo2 from Torque/Moab to Slurm later this year. At the same time, we will update the cluster OS, from CentOS 7.5 to CentOS 7.8, the compiler toolchains, and the application software packages. Older versions of the applications, currently on Sapelo2, will only be installed in the updated cluster if necessary, upon user request.
In preparation for implementing this major change in the Fall, we are deploying a Slurm development (dev) cluster, that will be available ahead of time. The goal is to give users an environment to modify their workflow scripts to use Slurm and possibly to use newer versions of the applications, prior to the major change. All job submission scripts will need to be changed, because Slurm uses different syntax from Torque/Moab, as summarized in Migrating from Torque to Slurm. We strongly encourage everyone to fully test their ported workflow scripts on the Slurm dev cluster, to ensure a smooth transition to the new system later in the year.
This dev cluster is intended to allow users to port their workflow scripts to Slurm, and it is not a platform for users to run jobs extensively. This dev cluster currently has the following resources:
Regular nodes
- 40 compute nodes with AMD Opteron processors (48 cores, 128GB RAM per node)
- 24 compute nodes with AMD EPYC processors (64 cores, 128GB RAM per node)
- 6 compute nodes with AMD EPYC processors (32 cores, 128GB RAM per node)
- 4 compute nodes with AMD Opteron processors (48 cores, 256GB RAM per node)
- 1 compute node with Intel Broadwell processors (28 cores, 64GB RAM per node)
- 1 compute node with Intel Skylake processors (32 cores, 192GB RAM per node)
High memory nodes (512GB)
- 2 compute nodes with AMD EPYC processors (32 cores, 512GB RAM per node)
- 4 compute nodes with AMD Opteron processors (48 cores, 512GB RAM per node)
GPU node
- 1 compute node with Intel Skylake processors (32 cores, 192GB RAM per node) and a P100 GPU card
Storage
The user's home directory (/home), scratch directory (/scratch), and each group's work directory (/work) on the Slurm test cluster are the same file systems as on Sapelo2. So there is no need to transfer data between Sapelo2 and Slurm test cluster. If you have Sapelo2 specific setting in your dotfiles (for example in .bashrc or in software specific configuration files), those might need to get changed when you work on Sap2test.
However, Sapelo2's /usr/local file system and therefore the applications installed on Sapelo2 are not available on the Slurm test cluster.
Connecting to the Slurm test cluster
Sapelo2 and Sap2test comparison
Software Installed on the Slurm test cluster
Code Compilation on the Slurm test cluster
Available Toolchains and Toolchain Compatibility
Running Jobs on the Slurm test cluster
Monitoring Jobs on Slurm test cluster
Sample batch job submission scripts on the Slurm test cluster
Migrating from Torque to Slurm
Teaching cluster
The teaching cluster is a Linux cluster that runs a 64-bit Linux, with Centos 7.5. The physical login node has two 6-core Intel Xeon E5-2620 processors and 128GB of RAM and it runs Red Hat EL 7.5. An Ethernet network (1Gbps) provides internodal communication among compute nodes, and between the compute nodes and the storage systems serving the home directories and the work directories.
The cluster is currently comprised of the following resources:
- 37 compute nodes with Intel Xeon X5650 2.67GHz processors (12 cores and 48GB of RAM per node)
- 2 compute nodes with Intel Xeon E5504 2.00GHz processors (8 cores and 48GB of RAM per node)
- 3 compute nodes with Intel Xeon E5504 2.00GHz processors (8 cores and 192GB of RAM per node)
- 2 compute nodes with AMD Opteron 6174 processors (48 cores and 128GB of RAM per node)
- 3 compute nodes with AMD Opteron 6128 HE 2.00GHz processors (32 cores and 64GB of RAM per node)
- 6 NVIDIA Tesla (Fermi) M2070 GPU cards (8 x 448 = 3584 GPU cores). These cards are installed on one host that has dual 6-core Intel Xeon CPUs and 48GB of RAM
The queueing system on the teaching cluster is Slurm.
Connecting to the teaching cluster
Transferring Files
Disk Storage
Software Installed on the teaching cluster
The list of installed application is available at Software page.