GPU: Difference between revisions

From Research Computing Center Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
 
(30 intermediate revisions by 2 users not shown)
Line 11: Line 11:
|-
|-
! scope="col" | Number of nodes
! scope="col" | Number of nodes
! scope="col" | CPU cores/node
! scope="col" | CPU cores per node
! scope="col" | Host memory/node
! scope="col" | Host memory per node
! scope="col" | CPU processor
! scope="col" | CPU processor
! scope="col" | GPU model
! scope="col" | GPU model
! scope="col" | GPU devices/node
! scope="col" | GPU devices per node
! scope="col" | Device memory
! scope="col" | Device memory
! scope="col" | GPU compute capability
! scope="col" | Minimum CUDA version
! scope="col" | Partition Name
! scope="col" | Partition Name
! scope="col" | Notes
! scope="col" | Notes
|-
|-
|-
|-
| 1 || 64 || 1TB  || AMD Milan || A100 || 4 || 80GB || gpu_p, gpu_30d_p || Need to request --gres=gpu:A100, e.g.,
| 12 || 64 || 1TB  || Intel Sapphire Rapids || H100 || 4 || 80GB || 9.0 || 11.8 || gpu_p, gpu_30d_p || Need to request --gres=gpu:H100, e.g.,
<nowiki>#</nowiki>SBATCH --partition=gpu_p
 
<nowiki>#</nowiki>SBATCH --gres=gpu:H100:1  
 
<nowiki>#</nowiki>SBATCH --time=7-00:00:00
|-
| 14 || 64 || 1TB  || AMD Milan || A100 || 4 || 80GB || 8.0 || 11.0 || gpu_p, gpu_30d_p || Need to request --gres=gpu:A100, e.g.,
<nowiki>#</nowiki>SBATCH --partition=gpu_p
<nowiki>#</nowiki>SBATCH --partition=gpu_p


<nowiki>#</nowiki>SBATCH --gres=gpu:A100:1  
<nowiki>#</nowiki>SBATCH --gres=gpu:A100:1  
<nowiki>#</nowiki>SBATCH --time=7-00:00:00
|-
|-
| 4 || 32 || 192GB || Intel Skylake || P100 || 1 || 16GB || gpu_p, gpu_30d_p ||
| 12 || 128 || 745GB  || AMD Genoa || L4 || 4 || 24GB || 8.9 || 11.8 || gpu_p, gpu_30d_p || Need to request --gres=gpu:L4, e.g.,
<nowiki>#</nowiki>SBATCH --partition=gpu_p
 
<nowiki>#</nowiki>SBATCH --gres=gpu:L4:1
 
<nowiki>#</nowiki>SBATCH --time=7-00:00:00
|-
|-
| 2 || 16 || 128GB || Intel Sandy Bridge  || K40m || 8 || 11GB || gpu_p, gpu_30d_p ||
| 2 || 32 || 192GB || Intel Skylake || P100 || 1 || 16GB || 6.0 || 8.0|| gpu_p, gpu_30d_p ||Need to request --gres=gpu:P100, e.g.,
<nowiki>#</nowiki>SBATCH --partition=gpu_p
 
<nowiki>#</nowiki>SBATCH --gres=gpu:P100:1
 
<nowiki>#</nowiki>SBATCH --time=7-00:00:00
|-
|-
| 2 || 12 || 94GB || Intel Westmere || K20Xm || 7 || 5.7GB || gpu_p, gpu_30d_p ||
| 1 || 64 || 1TB || AMD Milan || A100 || 4 || 80GB || 8.0 || 11.0 || buyin partition || rowspan="8" | Available on '''batch''' for all users up to '''4 hours''', e.g.,
|-
| 2 || 24 || 128GB || Intel Haswell || K80 || 2 || 11GB || buyin partition || rowspan="8" | Available on '''batch''' for all users up to '''4 hours''', e.g.,
<nowiki>#</nowiki>SBATCH --partition=batch
<nowiki>#</nowiki>SBATCH --partition=batch


<nowiki>#</nowiki>SBATCH --gres=gpu:V100:1 or #SBATCH --gres=gpu:P100:1  
<nowiki>#</nowiki>SBATCH --gres=gpu:A100:1 or
 
<nowiki>#</nowiki>SBATCH --gres=gpu:L4:1 or
 
<nowiki>#</nowiki>SBATCH --gres=gpu:V100:1 or  
 
<nowiki>#</nowiki>SBATCH --gres=gpu:V100S:1


<nowiki>#</nowiki>SBATCH --time=4:00:00
<nowiki>#</nowiki>SBATCH --time=4:00:00
|-
|-
| 2 || 28 || 256GB || Intel Broadwell || P100 || 1 || 16GB || buyin partition  
| 2 || 64 || 745GB  || AMD Genoa || L4 || 4 || 24GB || 8.9 || 11.8 || buyin partition
|-
|-
| 2 || 28 || 192GB || Intel Skylake || V100 || 1 || 16GB || buyin partition  
| 2 || 28 || 192GB || Intel Skylake || V100 || 1 || 16GB || 7.0|| 9.0 || buyin partition  
|-
|-
| 2 || 32 || 192GB || Intel Skylake || V100 || 1 || 16GB || buyin partition  
| 2 || 32 || 192GB || Intel Skylake || V100 || 1 || 16GB || 7.0 || 9.0 || buyin partition  
|-
|-
| 2 || 32 || 384GB || Intel Skylake || V100 || 1 || 32GB || buyin partition  
| 2 || 32 || 384GB || Intel Skylake || V100 || 1 || 32GB || 7.0 || 9.0 || buyin partition  
|-
|-
| 2 || 64 || 128GB || AMD Naples || V100 || 2 || 32GB || buyin partition  
| 2 || 64 || 128GB || AMD Naples || V100 || 2 || 32GB || 7.0 || 9.0 || buyin partition  
|-
|-
| 1 || 64 || 128GB || AMD Naples || V100 || 1 || 32GB || buyin partition  
| 1 || 64 || 128GB || AMD Naples || V100 || 1 || 32GB || 7.0 || 9.0 ||  buyin partition  
|-
|-
| 4 || 64 || 128GB || AMD Rome || V100S || 1 || 32GB || buyin partition  
| 4 || 64 || 128GB || AMD Rome || V100S || 1 || 32GB ||  7.0 || 9.0 || buyin partition  
|-
|-
|}
|}
'''Note:'''
1. The GPU compute capability of a device, also sometimes called its “SM version”, identifies the features supported by the GPU hardware. For more information, please see [https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capability NVIDIA compute capability]


===Software===
===Software===
Sapelo2 has the following tools for programming for GPUs:
Sapelo2 has several tools for GPU programming and many CUDA-enabled applications. For example:


'''1. NVIDIA CUDA toolkit'''
'''1. NVIDIA CUDA toolkit'''


Several versions of the CUDA toolkit are available. Please see our [[CUDA-Sapelo2|CUDA]] page.
Several versions of the CUDA toolkit are available. Please see our [[CUDA-Sapelo2|CUDA]] page.
 
<!--
'''2. PGI/CUDA compilers'''
'''2. PGI/CUDA compilers'''


Line 69: Line 98:


For information on versions of PGI compilers installed on Sapelo2, please see [[Code Compilation on Sapelo2]].
For information on versions of PGI compilers installed on Sapelo2, please see [[Code Compilation on Sapelo2]].
-->
'''2. cuDNN'''
The NVIDIA CUDA Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks.
To see all modules of cuDNN installed on Sapelo2, please use the command
<pre class="gcommand">
ml spider cuDNN
</pre>
'''3. NCCL'''
The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multi-node collective communication primitives that are performance optimized for NVIDIA GPUs.


'''3. OpenACC'''
To see all modules of cuDNN installed on Sapelo2, please use the command


Using the NVIDIA HPC SDK compiler suite or the old PGI Accelerator compilers, programmers can accelerate applications on x64+accelerator platforms by adding OpenACC compiler directives to Fortran and C programs and then recompiling with appropriate compiler options. Please see https://developer.nvidia.com/hpc-sdk and http://www.pgroup.com/resources/accel.htm
<pre class="gcommand">
ml spider NCCL
</pre>
 
'''4. OpenACC'''
 
Using the NVIDIA HPC SDK compiler suite, provided by the NVHPC module on Sapelo2, programmers can accelerate applications on x64+accelerator platforms by adding OpenACC compiler directives to Fortran and C programs and then recompiling with appropriate compiler options. Please see https://developer.nvidia.com/hpc-sdk and http://www.pgroup.com/resources/accel.htm


OpenACC is also supported by GNU compilers, especially the latest versions, e.g. GNU 7.2.0, installed on Sapelo2. For more information on OpenACC support by GNU compilers, please refer to https://gcc.gnu.org/wiki/OpenACC
OpenACC is also supported by GNU compilers, especially the latest versions, e.g. GNU 7.2.0, installed on Sapelo2. For more information on OpenACC support by GNU compilers, please refer to https://gcc.gnu.org/wiki/OpenACC


For information on versions of GNU compilers installed on Sapelo2, please see [[Code Compilation on Sapelo2]].
For information on versions of compilers installed on Sapelo2, please see [[Code Compilation on Sapelo2]].
 
'''5. CUDA-enabled applications'''
 
CUDA-enabled applications typically have a version suffix in the module name to indicate the version of CUDA that they were built with.
 
==== Software modules that are supported on the H100 and L4 nodes ====
New modules that are being installed centrally using CUDA versions 12.1.1 or higher will include support for GPU compute capability up to 9.0.  Some examples are:
 
* PyTorch/2.1.2-foss-2023a-CUDA-12.1.1 (note that this version uses the foss-2023a toolchain)
 
* GROMACS/2023.3-foss-2023a-CUDA-12.1.1-PLUMED-2.9.0
       
* GROMACS/2023.4-foss-2023a-CUDA-12.1.1
 
* magma/2.7.2-foss-2023a-CUDA-12.1.1
 
* NCCL/2.18.3-GCCcore-12.3.0-CUDA-12.1.1
 
* torchvision/0.16.2-foss-2023a-CUDA-12.1.1
 
====Software modules not supported on the H100 and L4 nodes====
 
Some modules that use CUDA 12.1.1 were installed before the H100 and L4 devices were added to the cluster and these only have support for GPU compute capability up to 8.0. Modules that do not work on the H100 and the L4 nodes include:
 
* module that use CUDA versions below 11.8.0
 
* PyTorch/2.1.2-foss-2022a-CUDA-12.1.1.lua (note that this version uses the foss-2022a toolchain)
 
* transformers/4.41.2-foss-2022a-PyTorch-2.1.2-CUDA-12.1.1.lua
 
* transformers/4.37.0-foss-2022a-PyTorch-2.1.2-CUDA-12.1.1.lua
 
* controlnet-aux/0.0.7-foss-2022a-PyTorch-2.1.2-CUDA-12.1.1.lua
 
* diffusers/0.25.1-foss-2022a-PyTorch-2.1.2-CUDA-12.1.1.lua
 
* torchvision/0.16.2-foss-2022a-CUDA-12.1.1.lua
 
* bitsandbytes/0.42.0-foss-2022a-PyTorch-2.1.2-CUDA-12.1.1.lua
 
* accelerate/0.26.1-foss-2022a-PyTorch-2.1.2-CUDA-12.1.1.lua
 
* timm/0.9.12-foss-2022a-CUDA-12.1.1.lua
 
* flash-attn/2.5.9.post1-foss-2022a-PyTorch-2.1.2-CUDA-12.1.1.lua
 
* magma/2.7.2-foss-2022a-CUDA-12.1.1
 
* NCCL/2.18.3-GCCcore-11.3.0-CUDA-12.1.1


===Running Jobs===
===Running Jobs===
For information on how to run GPU jobs on Sapelo2, please refer to [[Running Jobs on Sapelo2]].
For information on how to run GPU jobs on Sapelo2, please refer to [[Running Jobs on Sapelo2]].
'''Important notes:'''
1. If a job requests
<pre class="gscript">
#SBATCH --partition=gpu_p
#SBATCH --gres=gpu:1
</pre>
then it can get allocated any GPU device type (i.e. P100, A100, L4, or H100). If you opt for requesting a GPU device without specifying its type, please make sure that the application or code you are running works on all device types.
2. If the application that you are running uses an older version of CUDA, for example CUDA/11.4.1 or CUDA/11.7.0, please request an explicit GPU device that supports the CUDA version. For example, request an A100 device with
<pre class="gscript">
#SBATCH --partition=gpu_p
#SBATCH --gres=gpu:A100:1
</pre>
3. If the application or code that you are running does not need double precision operations, and it does not need over 24GB of GPU device memory, you could get a faster job throughput by running it on an L4 device, which can be requested with
<pre class="gscript">
#SBATCH --partition=gpu_p
#SBATCH --gres=gpu:L4:1
</pre>
4. If the application or code that you are running is supported by the H100 device, you can request an H100 device with
<pre class="gscript">
#SBATCH --partition=gpu_p
#SBATCH --gres=gpu:H100:1
</pre>
5. If the application or code that you are running works and an A100 and an H100 device, and you would like the job to be allocated to a node equipped with either an A100 or H100, you can use these header lines for your job
<pre class="gscript">
#SBATCH --partition=gpu_p
#SBATCH --gres=gpu:1
#SBATCH --constraint="SapphireRapids|Milan"
</pre>
By not specifying a specific GPU model when requesting a generic resource ("--gres"), you allow SLURM to allocate your job to any node on the GPU partition (assuming "--partition gpu_p" is used) with available resources. The GPU partition has nodes with GPUs other than A100s and H100s, so to prevent the job from running on a node with another GPU model (i.e. an L4 or P100), restrict your job to run only on nodes with either a Milan or SapphireRapids processor. In the GPU partition, only nodes with an A100 or H100 GPU have Milan or SapphireRapids processors, respectively.

Latest revision as of 11:28, 19 September 2024


GPU Computing on Sapelo2

Hardware

For a description of the Graphics Processing Units (GPU) device specifications, please see GPU Hardware.

The following table summarizes the GPU devices available on sapelo2:

Number of nodes CPU cores per node Host memory per node CPU processor GPU model GPU devices per node Device memory GPU compute capability Minimum CUDA version Partition Name Notes
12 64 1TB Intel Sapphire Rapids H100 4 80GB 9.0 11.8 gpu_p, gpu_30d_p Need to request --gres=gpu:H100, e.g.,

#SBATCH --partition=gpu_p

#SBATCH --gres=gpu:H100:1

#SBATCH --time=7-00:00:00

14 64 1TB AMD Milan A100 4 80GB 8.0 11.0 gpu_p, gpu_30d_p Need to request --gres=gpu:A100, e.g.,

#SBATCH --partition=gpu_p

#SBATCH --gres=gpu:A100:1

#SBATCH --time=7-00:00:00

12 128 745GB AMD Genoa L4 4 24GB 8.9 11.8 gpu_p, gpu_30d_p Need to request --gres=gpu:L4, e.g.,

#SBATCH --partition=gpu_p

#SBATCH --gres=gpu:L4:1

#SBATCH --time=7-00:00:00

2 32 192GB Intel Skylake P100 1 16GB 6.0 8.0 gpu_p, gpu_30d_p Need to request --gres=gpu:P100, e.g.,

#SBATCH --partition=gpu_p

#SBATCH --gres=gpu:P100:1

#SBATCH --time=7-00:00:00

1 64 1TB AMD Milan A100 4 80GB 8.0 11.0 buyin partition Available on batch for all users up to 4 hours, e.g.,

#SBATCH --partition=batch

#SBATCH --gres=gpu:A100:1 or

#SBATCH --gres=gpu:L4:1 or

#SBATCH --gres=gpu:V100:1 or

#SBATCH --gres=gpu:V100S:1

#SBATCH --time=4:00:00

2 64 745GB AMD Genoa L4 4 24GB 8.9 11.8 buyin partition
2 28 192GB Intel Skylake V100 1 16GB 7.0 9.0 buyin partition
2 32 192GB Intel Skylake V100 1 16GB 7.0 9.0 buyin partition
2 32 384GB Intel Skylake V100 1 32GB 7.0 9.0 buyin partition
2 64 128GB AMD Naples V100 2 32GB 7.0 9.0 buyin partition
1 64 128GB AMD Naples V100 1 32GB 7.0 9.0 buyin partition
4 64 128GB AMD Rome V100S 1 32GB 7.0 9.0 buyin partition

Note:

1. The GPU compute capability of a device, also sometimes called its “SM version”, identifies the features supported by the GPU hardware. For more information, please see NVIDIA compute capability

Software

Sapelo2 has several tools for GPU programming and many CUDA-enabled applications. For example:

1. NVIDIA CUDA toolkit

Several versions of the CUDA toolkit are available. Please see our CUDA page.

2. cuDNN

The NVIDIA CUDA Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks.

To see all modules of cuDNN installed on Sapelo2, please use the command

ml spider cuDNN

3. NCCL

The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multi-node collective communication primitives that are performance optimized for NVIDIA GPUs.

To see all modules of cuDNN installed on Sapelo2, please use the command

ml spider NCCL

4. OpenACC

Using the NVIDIA HPC SDK compiler suite, provided by the NVHPC module on Sapelo2, programmers can accelerate applications on x64+accelerator platforms by adding OpenACC compiler directives to Fortran and C programs and then recompiling with appropriate compiler options. Please see https://developer.nvidia.com/hpc-sdk and http://www.pgroup.com/resources/accel.htm

OpenACC is also supported by GNU compilers, especially the latest versions, e.g. GNU 7.2.0, installed on Sapelo2. For more information on OpenACC support by GNU compilers, please refer to https://gcc.gnu.org/wiki/OpenACC

For information on versions of compilers installed on Sapelo2, please see Code Compilation on Sapelo2.

5. CUDA-enabled applications

CUDA-enabled applications typically have a version suffix in the module name to indicate the version of CUDA that they were built with.

Software modules that are supported on the H100 and L4 nodes

New modules that are being installed centrally using CUDA versions 12.1.1 or higher will include support for GPU compute capability up to 9.0. Some examples are:

  • PyTorch/2.1.2-foss-2023a-CUDA-12.1.1 (note that this version uses the foss-2023a toolchain)
  • GROMACS/2023.3-foss-2023a-CUDA-12.1.1-PLUMED-2.9.0
  • GROMACS/2023.4-foss-2023a-CUDA-12.1.1
  • magma/2.7.2-foss-2023a-CUDA-12.1.1
  • NCCL/2.18.3-GCCcore-12.3.0-CUDA-12.1.1
  • torchvision/0.16.2-foss-2023a-CUDA-12.1.1

Software modules not supported on the H100 and L4 nodes

Some modules that use CUDA 12.1.1 were installed before the H100 and L4 devices were added to the cluster and these only have support for GPU compute capability up to 8.0. Modules that do not work on the H100 and the L4 nodes include:

  • module that use CUDA versions below 11.8.0
  • PyTorch/2.1.2-foss-2022a-CUDA-12.1.1.lua (note that this version uses the foss-2022a toolchain)
  • transformers/4.41.2-foss-2022a-PyTorch-2.1.2-CUDA-12.1.1.lua
  • transformers/4.37.0-foss-2022a-PyTorch-2.1.2-CUDA-12.1.1.lua
  • controlnet-aux/0.0.7-foss-2022a-PyTorch-2.1.2-CUDA-12.1.1.lua
  • diffusers/0.25.1-foss-2022a-PyTorch-2.1.2-CUDA-12.1.1.lua
  • torchvision/0.16.2-foss-2022a-CUDA-12.1.1.lua
  • bitsandbytes/0.42.0-foss-2022a-PyTorch-2.1.2-CUDA-12.1.1.lua
  • accelerate/0.26.1-foss-2022a-PyTorch-2.1.2-CUDA-12.1.1.lua
  • timm/0.9.12-foss-2022a-CUDA-12.1.1.lua
  • flash-attn/2.5.9.post1-foss-2022a-PyTorch-2.1.2-CUDA-12.1.1.lua
  • magma/2.7.2-foss-2022a-CUDA-12.1.1
  • NCCL/2.18.3-GCCcore-11.3.0-CUDA-12.1.1

Running Jobs

For information on how to run GPU jobs on Sapelo2, please refer to Running Jobs on Sapelo2.

Important notes:

1. If a job requests

#SBATCH --partition=gpu_p
#SBATCH --gres=gpu:1

then it can get allocated any GPU device type (i.e. P100, A100, L4, or H100). If you opt for requesting a GPU device without specifying its type, please make sure that the application or code you are running works on all device types.

2. If the application that you are running uses an older version of CUDA, for example CUDA/11.4.1 or CUDA/11.7.0, please request an explicit GPU device that supports the CUDA version. For example, request an A100 device with

#SBATCH --partition=gpu_p
#SBATCH --gres=gpu:A100:1

3. If the application or code that you are running does not need double precision operations, and it does not need over 24GB of GPU device memory, you could get a faster job throughput by running it on an L4 device, which can be requested with

#SBATCH --partition=gpu_p
#SBATCH --gres=gpu:L4:1

4. If the application or code that you are running is supported by the H100 device, you can request an H100 device with

#SBATCH --partition=gpu_p
#SBATCH --gres=gpu:H100:1

5. If the application or code that you are running works and an A100 and an H100 device, and you would like the job to be allocated to a node equipped with either an A100 or H100, you can use these header lines for your job

#SBATCH --partition=gpu_p
#SBATCH --gres=gpu:1
#SBATCH --constraint="SapphireRapids|Milan"

By not specifying a specific GPU model when requesting a generic resource ("--gres"), you allow SLURM to allocate your job to any node on the GPU partition (assuming "--partition gpu_p" is used) with available resources. The GPU partition has nodes with GPUs other than A100s and H100s, so to prevent the job from running on a node with another GPU model (i.e. an L4 or P100), restrict your job to run only on nodes with either a Milan or SapphireRapids processor. In the GPU partition, only nodes with an A100 or H100 GPU have Milan or SapphireRapids processors, respectively.