Systems: Difference between revisions

From Research Computing Center Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
 
(98 intermediate revisions by 6 users not shown)
Line 1: Line 1:
[[Category:Zcluster]]
[[Category:Sapelo2]]
[[Category:Sapelo]]
[[Category:Teaching]]


=== Zcluster ===
<!--
=== Sapelo ===


The Linux cluster is comprised of compute nodes with 4-, 6-, 8-, and 12-core processors from both Intel and AMD. Subsets of nodes have "large memory" (e.g., 128, 256, or 512 GB of RAM), while others have InfiniBand connectivity or GPU capabilities. Total CPU compute power is 25.9 Tflops.
Sapelo is a Linux cluster that runs a 64-bit CentOS 6.5 operating system
and the login nodes has Intel Xeon processors. A QDR Infiniband network (40Gbps) provides internodal communication among
compute nodes, and between the compute nodes and the storage systems serving the home directories and the
scratch directories.


The cluster is currently comprised of the following resources:
The cluster is currently comprised of the following resources:  


*230 compute nodes (2600 compute cores), 32 with InfiniBand connectivity.
* 16 compute nodes with AMD Opteron processors (48 cores and 128GB of RAM per node)  
*Four 8-core, 192GB high-memory compute nodes
* four 48-core 256GB RAM nodes with AMD Opteron processors (n16, n17, n18, n19)
*Ten 12-core, 256GB high-memory compute nodes
* one 48-core 512GB RAM nodes with AMD Opteron processors (n20)
*Two 32-core, 512GB high-memory compute nodes
*Six 32-core, 64GB high-memory compute nodes
*One NVIDIA Tesla S1070 with four GPU cards (4 x 240 = 960 GPU cores) for programs written to use this architecture.
*One NVIDIA Tesla (Fermi) C2075 GPU processor (448 GPU cores)
*Nine NVIDIA Tesla (Fermi) M2070 GPU cards (9 x 448 = 4032 GPU cores). These cards are installed on 2 hosts each of which has dual 6-core Intel Xeon CPUs and 48GB of RAM; there are 6 GPU cards on one host and 3 on the other.
*32 NVIDIA Tesla (Kepler) K20X GPU cards (32 x 2688 = 86016 GPU cores). These cards are installed on 4 hosts each of which has dual 6-core Intel Xeon CPUs and 96GB of RAM; there are 8 GPU cards per host.


[[Connecting]]
====[[Connecting]]====


[[Transferring Files]]
====[[Code Compilation on Sapelo]]====


[[Disk Storage]]
====[[Running Jobs on Sapelo]]====


[[Code Compilation on zcluster]]
----
[[#top|Back to Top]]


[[Running Jobs on zcluster]]
-->


=== Sapelo ===


[[Connecting-Sapelo | Connecting On Sapelo]]
===  Sapelo2 ===
 
Sapelo2 is a Linux cluster that runs a 64-bit Rocky 9.5 operating system and it is managed using Warewulf. Several virtual login nodes are available, with Intel Xeon Gold 6230 processors, 32GB of RAM, and 16 cores per node. The queueing system on Sapelo2 is Slurm.
 
Internodal communication among the compute nodes and between these nodes and the storage systems serving the home directories and the scratch directories is provided by an EDR Infiniband network (100Gbps).
 
 
The cluster is currently comprised of the following resources:
 
'''Regular nodes'''
 
* 14 compute nodes with AMD EPYC (Genoa 4th gen) processors (128 cores and 745GB of RAM per node)
* 120 compute nodes with AMD EPYC (Milan 3rd gen) processors (128 cores and 512GB of RAM per node)
* 4 compute nodes with AMD EPYC (Milan 3rd gen) processors (64 cores and 256GB of RAM per node)
* 2 compute nodes with AMD EPYC (Milan 3rd gen) processors (64 cores and 128GB of RAM per node)
* 123 compute nodes with AMD EPYC (Rome 2nd gen) processors (64 cores and 128GB of RAM per node)
* 50 compute nodes with AMD EPYC (Naples 1st gen) processors (32 cores and 128GB of RAM per node)
* 42 compute nodes with Intel Xeon Skylake processors (32 cores and 192GB of RAM per node)
 
 
'''High memory nodes (3TB/node)'''
 
*  3 compute nodes with AMD EPYC (Genoa 4th gen) processors (48 cores and 3TB of RAM per node)
 
 
'''High memory nodes (2TB/node)'''
 
*  2 compute nodes with AMD EPYC (Rome 2nd gen) processors (32 cores and 2TB of RAM per node)
 
 
'''High memory nodes (1TB/node)'''
 
* 2 compute nodes with AMD EPYC (Milan 3rd gen) processors (128 cores and 1TB of RAM per node)
* 12 compute nodes with AMD EPYC (Milan 3rd gen) processors (32 cores and 1TB of RAM per node)
* 2 compute nodes with AMD EPYC (Naples 1st gen) processors (64 cores and 1TB of RAM per node)
* 1 compute nodes with Intel Xeon Broadwell processors (28 cores and 1TB of RAM per node)
 
 
'''High memory nodes (512GB/node)'''
 
* 18 compute nodes with AMD EPYC (Naples 1st gen) processors (32 cores and 512GB of RAM per node)
<!-- *  1 compute node with Intel Xeon Nehalem processors (32 cores and 512GB of RAM per node) -->
 
 
'''GPU nodes'''
 
* 12 compute nodes with Intel Xeon SapphireRapids processors (64 cores and 1TB of RAM) and 4x NVIDIA H100 GPU cards.
* 12 compute nodes with AMD EPYC (Genoa 4th gen) processors (128 cores and 745GB of RAM) and 4x NVIDIA L4 GPU cards.
* 14 compute nodes with AMD EPYC (Milan 3rd gen) processors (64 cores and 1TB of RAM) and 4x NVIDIA A100 GPU cards.
* 2 compute nodes with Intel Xeon Skylake processors (32 cores and 187GB of RAM) and 1x NVIDIA P100 GPU card per node
<!-- * 2 compute nodes with Intel Xeon processors (16 cores and 128GB of RAM) and 8x NVIDIA K40m GPU cards per node -->
 
 
'''Buy-in nodes'''
 
* Various configurations
 
 
<!--
'''Notes'''
 
<blockquote style="background-color: lightyellow; border: solid thin grey;">
Your home directory and /lustre1 directory on Sapelo2 are the same as on Sapelo. Therefore, there is no need to transfer data between your Sapelo and Sapelo2 home directories and /lustre1 directories.
</blockquote>
 
The queueing system on Sapelo2 is Torque/Moab.
 
====[[Sapelo2 Frequently Asked Questions]]====
 
====[[Sapelo and Sapelo2 comparison]]====
 
-->
====[[Connecting#Connecting_to_Sapelo2 |Connecting to Sapelo2]]====
 
====[[Transferring Files]]====
 
====[[Disk Storage]]====
 
====[[Software on Sapelo2]]====
 
====[[Available Toolchains and Toolchain Compatibility]]====
 
====[[Code Compilation on Sapelo2]]====
 
====[[Running Jobs on Sapelo2]]====
 
====[[Monitoring Jobs on Sapelo2]]====
 
====[[Migrating from Torque to Slurm]]====
 
 
 
'''Training material'''
 
To help users familiarize with Slurm and the test cluster environment, we have prepared some training videos that are available from the GACRC's Kaltura channel at https://kaltura.uga.edu/channel/GACRC/176125031 (login with MyID and password is required). Training sessions and slides are available at https://wiki.gacrc.uga.edu/wiki/Training
 
 
----
[[#top|Back to Top]]
 
<!--
===  Slurm Test Cluster (Sap2test) ===
 
GACRC is planning to switch the queueing system on Sapelo2 from Torque/Moab to Slurm later this year. At the same time, we will update the cluster OS, from CentOS 7.5 to CentOS 7.8, the compiler toolchains, and the application software packages. Older versions of the applications, currently on Sapelo2, will only be installed in the updated cluster if necessary, upon user request.
 
In preparation for implementing this major change in the Fall, we are deploying a Slurm development (dev) cluster, that will be available ahead of time. The goal is to give users an environment to modify their workflow scripts to use Slurm and possibly to use newer versions of the applications, prior to the major change. All job submission scripts will need to be changed, because Slurm uses different syntax from Torque/Moab, as summarized in [[Migrating from Torque to Slurm]]. We strongly encourage everyone to fully test their ported workflow scripts on the Slurm dev cluster, to ensure a smooth transition to the new system later in the year.
 
This dev cluster is intended to allow users to port their workflow scripts to Slurm, and it is not a platform for users to run jobs extensively. This dev cluster currently has the following resources:
 
'''Regular nodes'''
 
* 40 compute nodes with AMD Opteron processors (48 cores, 128GB RAM per node)
* 24 compute nodes with AMD EPYC processors (64 cores, 128GB RAM per node)
*  6 compute nodes with AMD EPYC processors (32 cores, 128GB RAM per node)
*  4 compute nodes with AMD Opteron processors (48 cores, 256GB RAM per node)
*  1 compute node with Intel Broadwell processors (28 cores, 64GB RAM per node)
*  1 compute node with Intel Skylake processors (32 cores, 192GB RAM per node)
 
'''High memory nodes (512GB)'''
 
*  2 compute nodes with AMD EPYC processors (32 cores, 512GB RAM per node)
*  4 compute nodes with AMD Opteron processors (48 cores, 512GB RAM per node)
 
'''GPU node'''
 
*  1 compute node with Intel Skylake processors (32 cores, 192GB RAM per node) and a P100 GPU card
 
'''Storage'''
 
The user's home directory (/home), scratch directory (/scratch), and each group's work directory (/work) on the Slurm test cluster are the same file systems as on Sapelo2. So there is no need to transfer data between Sapelo2 and Slurm test cluster. If you have Sapelo2 specific settings in your dotfiles (for example in .bashrc or in software specific configuration files), those might need to get changed when you work on Sap2test. The environment variable GACRC_CLUSTER stores the test cluster name, and can be used to set up a cluster specific dotfile to use on the test cluster.
 
However, Sapelo2's /usr/local file system and therefore the applications installed on Sapelo2 are not available on the Slurm test cluster.
 
 
'''Training material'''
 
To help users familiarize with Slurm and the test cluster environment, we have prepared some training videos that are available from the GACRC's Kaltura channel at https://kaltura.uga.edu/channel/GACRC/176125031 (login with MyID and password is required). Training sessions and slides are available at https://wiki.gacrc.uga.edu/wiki/Training
 
 
'''Getting Help'''
 
If you run into any issues on the test cluster or have any questions or suggestions, please let me know via the online form below, as it will reach all the GACRC staff members:
 
[https://uga.teamdynamix.com/TDClient/2060/Portal/Requests/ServiceDet?ID=41600 Support for Slurm test cluster]
 
 
====[[Connecting to the Slurm test cluster]]====
 
====[[Sapelo2 and Sap2test comparison]]====
 
====[[Software on sap2test | Software Installed on the Slurm test cluster]]====
 
====[[Code Compilation on Sap2test]]====
 
====[[Available Toolchains and Toolchain Compatibility]]====
 
====[[Running Jobs on Sap2test | Running Jobs on the Slurm test cluster]]====
 
====[[Monitoring Jobs on Sap2test | Monitoring Jobs on Slurm test cluster]]====
 
====[[Sample batch job submission scripts on the Slurm test cluster]]====
 
====[[Migrating from Torque to Slurm]]====
 
 
 
----
[[#top|Back to Top]]
 
-->
 
===  Teaching cluster ===
 
The teaching cluster is a Linux cluster that runs a 64-bit Linux, with Rocky 8.8. The login node is a VM that has 4 cores (Intel Xeon Gold 6230 processor) and 16GB of RAM. An EDR Infiniband network (100Gbps) provides internodal communication among compute nodes, and between the compute nodes and the storage systems serving the home directories and the work directories.
 
The cluster is currently comprised of the following resources:
 
'''Regular nodes:'''
 
* 10 compute nodes with AMD EPYC (Naples 1st gen) processors (32 cores and 128GB or RAM per node)
 
'''High-memory nodes:'''
 
* 2 compute nodes with AMD EPYC (Naples 1st gen) processors (64 cores and 1TB of RAM per node)
 
'''GPU nodes:'''
 
* 1 compute node with Intel Skylake processors (32 cores, 192GB RAM per node) and a P100 GPU card
<!--
*30 compute nodes with Intel Xeon X5650 2.67GHz processors (12 cores and 48GB of RAM per node)
* 2 compute nodes with Intel Xeon L7555 1.87GHz processors (32 cores and 512GB of RAM per node)
* 4 NVIDIA Tesla (Kepler) K20Xm GPU cards. These cards are installed on one host that has dual 6-core Intel Xeon CPUs and 48GB of RAM
-->
 
The queueing system on the teaching cluster is Slurm.
 
====[[Connecting#Connecting_to_the_teaching_cluster |Connecting to the teaching cluster]]====
 
 
====[[Transferring Files]]==== 
<!--
====[[Disk Storage]]====
-->
====Software Installed on the teaching cluster====
 
The teaching cluster has access to the same software stack installed on Sapelo2.
 
====[[Code Compilation on the teaching cluster]]====
 
====[[Running Jobs on the teaching cluster]]====
 
====[[Monitoring Jobs on the teaching cluster]]====

Latest revision as of 11:10, 1 August 2025



Sapelo2

Sapelo2 is a Linux cluster that runs a 64-bit Rocky 9.5 operating system and it is managed using Warewulf. Several virtual login nodes are available, with Intel Xeon Gold 6230 processors, 32GB of RAM, and 16 cores per node. The queueing system on Sapelo2 is Slurm.

Internodal communication among the compute nodes and between these nodes and the storage systems serving the home directories and the scratch directories is provided by an EDR Infiniband network (100Gbps).


The cluster is currently comprised of the following resources:

Regular nodes

  • 14 compute nodes with AMD EPYC (Genoa 4th gen) processors (128 cores and 745GB of RAM per node)
  • 120 compute nodes with AMD EPYC (Milan 3rd gen) processors (128 cores and 512GB of RAM per node)
  • 4 compute nodes with AMD EPYC (Milan 3rd gen) processors (64 cores and 256GB of RAM per node)
  • 2 compute nodes with AMD EPYC (Milan 3rd gen) processors (64 cores and 128GB of RAM per node)
  • 123 compute nodes with AMD EPYC (Rome 2nd gen) processors (64 cores and 128GB of RAM per node)
  • 50 compute nodes with AMD EPYC (Naples 1st gen) processors (32 cores and 128GB of RAM per node)
  • 42 compute nodes with Intel Xeon Skylake processors (32 cores and 192GB of RAM per node)


High memory nodes (3TB/node)

  • 3 compute nodes with AMD EPYC (Genoa 4th gen) processors (48 cores and 3TB of RAM per node)


High memory nodes (2TB/node)

  • 2 compute nodes with AMD EPYC (Rome 2nd gen) processors (32 cores and 2TB of RAM per node)


High memory nodes (1TB/node)

  • 2 compute nodes with AMD EPYC (Milan 3rd gen) processors (128 cores and 1TB of RAM per node)
  • 12 compute nodes with AMD EPYC (Milan 3rd gen) processors (32 cores and 1TB of RAM per node)
  • 2 compute nodes with AMD EPYC (Naples 1st gen) processors (64 cores and 1TB of RAM per node)
  • 1 compute nodes with Intel Xeon Broadwell processors (28 cores and 1TB of RAM per node)


High memory nodes (512GB/node)

  • 18 compute nodes with AMD EPYC (Naples 1st gen) processors (32 cores and 512GB of RAM per node)


GPU nodes

  • 12 compute nodes with Intel Xeon SapphireRapids processors (64 cores and 1TB of RAM) and 4x NVIDIA H100 GPU cards.
  • 12 compute nodes with AMD EPYC (Genoa 4th gen) processors (128 cores and 745GB of RAM) and 4x NVIDIA L4 GPU cards.
  • 14 compute nodes with AMD EPYC (Milan 3rd gen) processors (64 cores and 1TB of RAM) and 4x NVIDIA A100 GPU cards.
  • 2 compute nodes with Intel Xeon Skylake processors (32 cores and 187GB of RAM) and 1x NVIDIA P100 GPU card per node


Buy-in nodes

  • Various configurations


Connecting to Sapelo2

Transferring Files

Disk Storage

Software on Sapelo2

Available Toolchains and Toolchain Compatibility

Code Compilation on Sapelo2

Running Jobs on Sapelo2

Monitoring Jobs on Sapelo2

Migrating from Torque to Slurm

Training material

To help users familiarize with Slurm and the test cluster environment, we have prepared some training videos that are available from the GACRC's Kaltura channel at https://kaltura.uga.edu/channel/GACRC/176125031 (login with MyID and password is required). Training sessions and slides are available at https://wiki.gacrc.uga.edu/wiki/Training



Back to Top


Teaching cluster

The teaching cluster is a Linux cluster that runs a 64-bit Linux, with Rocky 8.8. The login node is a VM that has 4 cores (Intel Xeon Gold 6230 processor) and 16GB of RAM. An EDR Infiniband network (100Gbps) provides internodal communication among compute nodes, and between the compute nodes and the storage systems serving the home directories and the work directories.

The cluster is currently comprised of the following resources:

Regular nodes:

  • 10 compute nodes with AMD EPYC (Naples 1st gen) processors (32 cores and 128GB or RAM per node)

High-memory nodes:

  • 2 compute nodes with AMD EPYC (Naples 1st gen) processors (64 cores and 1TB of RAM per node)

GPU nodes:

  • 1 compute node with Intel Skylake processors (32 cores, 192GB RAM per node) and a P100 GPU card

The queueing system on the teaching cluster is Slurm.

Connecting to the teaching cluster

Transferring Files

Software Installed on the teaching cluster

The teaching cluster has access to the same software stack installed on Sapelo2.

Code Compilation on the teaching cluster

Running Jobs on the teaching cluster

Monitoring Jobs on the teaching cluster