Running Jobs on Sapelo2: Difference between revisions

From Research Computing Center Wiki
Jump to navigation Jump to search
 
(157 intermediate revisions by 6 users not shown)
Line 1: Line 1:
[[Category:Sapelo2]]
[[Category:Sapelo2]]
'''Note: This page is for using new queueing system on the Sapelo2 cluster. This page is still under development as of June 18, 2020.'''
If you are current Sapelo2 users, please refer to [[Running Jobs on Sapelo2]] for instructions on how to run jobs on Sapelo2.


===Using the Queueing System===
===Using the Queueing System===
Line 15: Line 10:
[[#top|Back to Top]]
[[#top|Back to Top]]


===Batch Queues defined on the Sapelo2===
===Batch partitions (queues) defined on the Sapelo2===


There are different queues defined on Sapelo2. The Slurm queueing system refers to queues as partition. Users are required to specify, in the job submission script or as job submission command line arguments, the queue and the resources needed by the job in order for it to be assigned to compute node(s) that have enough available resources (such as number of cores, amount of memory, GPU cards, etc). Please note, Slurm will not allow a job to be submitted if there are no resources matching your request. Please refer to [[Migrating from Torque to Slurm]] for more info about Slurm queueing system.
There are different partitions defined on Sapelo2. The Slurm queueing system refers to queues as partition. Users are required to specify, in the job submission script or as job submission command line arguments, the partition and the resources needed by the job in order for it to be assigned to compute node(s) that have enough available resources (such as number of cores, amount of memory, GPU cards, etc). Please note, Slurm will not allow a job to be submitted if there are no resources matching your request. Please refer to [[Migrating from Torque to Slurm]] for more info about Slurm queueing system.
 
The following partitions are defined on the Sapelo2 cluster:
 
{|  width="100%" border="1"  cellspacing="0" cellpadding="2" align="center" class="wikitable unsortable"
|-
! scope="col" | Partition Name
! scope="col" | Time limit
! scope="col" | Max jobs
! scope="col" | Notes
|-
|-
| batch || 7 days ||  || Regular nodes.
|-
| batch-30d || 30 days || 2 || Regular nodes. A given user can have up to one job running at a time here, plus one pending, or two pending and none running. A user's attempt to submit a third job into this partition will be rejected.
|-
| highmem_p || 7 days ||  || For high memory jobs
|-
| highmem_30d_p || 30 days || 2 || For high memory jobs. A given user can have up to one job running at a time here, plus one pending, or two pending and none running. A user's attempt to submit a third job into this partition will be rejected.
|-
|hugemem_p || 7 days ||4 || For jobs needing up to 3TB of memory.
|-
|hugemem_30d_p || 30 days || 4 || For jobs needing up to 3TB of memory.
|-
| gpu_p || 7 days ||  || For GPU-enabled jobs.
|-
| gpu_30d_p || 30 days || 2 || For GPU-enabled jobs. A given user can have up to one job running at a time here, plus one pending, or two pending and none running. A user's attempt to submit a third job into this partition will be rejected.
|-
| inter_p || 2 days ||  || Regular nodes, for interactive jobs.
|-
| '''name'''_p || variable ||  || Partitions that target different groups' buy-in nodes. The '''name''' string is specific to each group.
|-
| scavenge_p || 2 hours ||  || Partition that targets the buy-in nodes. When there are no available resources in the batch partition, short jobs submitted there might be automatically transferred into scavenge_p, to run on idle buy-in resources. Jobs cannot be submitted directly to this partition.
|-
|}
For more detailed information about the partitions, please see [[Job Submission partitions on Sapelo2]].


The table below summarizes the partitions (queues) defined and the compute nodes that they target:
The table below summarizes the partitions (queues) defined and the compute nodes that they target:
{|  width="100%" border="1"  cellspacing="0" cellpadding="2" align="center" class="wikitable unsortable"
{|  width="100%" border="1"  cellspacing="0" cellpadding="2" align="center" class="wikitable unsortable"
|-
|-
! scope="col" | Queue Name
! scope="col" | Partition Name
! scope="col" | Node Type
! scope="col" | Node Features
! scope="col" | Node Number
! scope="col" | Node Number
! scope="col" | Description
! scope="col" | Description
! scope="col" | Memory for jobs
! scope="col" | Notes
! scope="col" | Notes
|-
|-
|-
|-
| (to be added)
| batch, batch_30d || AMD, Opteron, QDR ||  || 48-core, 128GB RAM, AMD Opteron, QDR IB interconnect || 122GB || Regular nodes.
|-
| batch, batch_30d  || AMD, EPYC, EDR ||  || 64-core, 128GB RAM, AMD EPYC, IB EDR interconnect || 120GB || Regular nodes
|-
| batch, batch_30d  || AMD, EPYC, EDR ||  || 32-core, 128GB RAM, AMD EPYC, IB EDR interconnect || 120GB || Regular nodes
|-
| batch, batch_30d  || AMD, Opteron, QDR ||  || 48-core, 256GB RAM, AMD Opteron, QDR IB interconnect || 250GB || Regular nodes.
|-
| batch, batch_30d  || Intel, Skylake, EDR ||  || 32-core, 192GB RAM, Intel Xeon Skylake, IB EDR interconnect || 180GB || Regular nodes
|-
| batch, batch_30d  || Intel, Broadwell, EDR ||  || 28-core, 64GB RAM, Intel Xeon Broadwell, IB EDR interconnect || 58GB || Regular nodes
|-
| highmem_p, highmem_30d_p || AMD, EPYC, EDR ||  || 64-core, 1TB RAM, AMD EPYC, IB EDR interconnect || 950GB || For high memory jobs
|-
| highmem_p, highmem_30d_p || Intel, EDR ||  || 32-core, 1TB RAM, Intel, IB EDR interconnect || 950GB || For high memory jobs
|-
| highmem_p, highmem_30d_p || AMD, Opteron, EDR ||  || 48-core, 1TB RAM, AMD Opteron, IB EDR interconnect || 950GB || For high memory jobs
|-
| highmem_p, highmem_30d_p || AMD, Opteron, QDR ||  || 48-core, 512GB, AMD Opteron, IB QDR interconnect || 500GB || For high memory jobs
|-
| highmem_p, highmem_30d_p || AMD, EPYC, EDR ||  || 32-core, 512GB RAM, AMD EPYC, IB EDR interconnect || 490GB || For high memory jobs
|-
| hugemem_p, hugemem_30d_p || AMD, EPYC, EDR ||  || 32-core, 2TB RAM, AMD EPYC, IB EDR interconnect || 2000GB || For high memory jobs
|-
|hugemem_p, hugemem_30d_p
|AMD, EPYC, EDR
|
|48-core, 3TB RAM, AMD EPYC, IB EDR interconnect
|3000GB
|For high memory jobs
|-
|-
| gpu_p, gpu_30d_p || GPU, A100, EDR ||  || 64-core, 1000GB RAM, AMD EPYC, 4 NVIDIA A100 GPUs, EDR IB interconnect  || 1000GB || For GPU-enabled jobs.
|-
| gpu_p, gpu_30d_p || GPU, P100, EDR ||  || 32-core, 192GB RAM, Intel Xeon Skylake, 1 NVIDIA P100 GPUs, EDR IB interconnect  || 180GB || For GPU-enabled jobs.
|-
| gpu_p, gpu_30d_p || GPU, K40, QDR ||  || 16-core, 128GB RAM, Intel Xeon , 8 NVIDIA K40 GPUs, QDR IB interconnect  || 120GB || For GPU-enabled jobs.
|-
| gpu_p, gpu_30d_p || GPU, K20, QDR ||  || 12-core, 96GB RAM, Intel Xeon , 7 NVIDIA K20Xm GPUs, QDR IB interconnect  || 70GB || For GPU-enabled jobs.
|-
|}
|}


Line 43: Line 113:
===Job submission Scripts===
===Job submission Scripts===


Users are required to specify the number of cores, the amount of memory, the queue name, and the maximum wallclock time needed by the job.
Users are required to specify the number of cores, the amount of memory, the partition (queue) name, and the maximum wallclock time needed by the job.


====Header lines====
====Header lines====
Line 55: Line 125:
#SBATCH --job-name=test
#SBATCH --job-name=test
#SBATCH --ntasks=1
#SBATCH --ntasks=1
#SBATCH --time=48:00:00
#SBATCH --time=4:00:00
#SBATCH --mem=10gb
#SBATCH --mem=10G
</pre>
</pre>


Line 63: Line 133:
'''Header lines explained:'''
'''Header lines explained:'''


* '''#!/bin/bash''': used to specify Linux default bash shell
* '''#!/bin/bash''': specify Linux default shell bash
* '''#SBATCH --partition=batch''' : used to specify the partition (queue) name, e.g. ''batch''
* '''#SBATCH --partition=batch''' : specify the partition (queue) to run on, e.g. ''batch''
* '''#SBATCH --job-name=test''' : used to specify the name of the job, e.g. ''test''
* '''#SBATCH --job-name=test''' : specify the job name, e.g. ''test''
* '''#SBATCH --ntasks=1''' : used to specify the number of tasks (e.g. 1).
* '''#SBATCH --ntasks=1''' : specify the number of tasks (e.g. 1)
* '''#SBATCH --time=48:00:00''' : used to specify the maximum allowed wall clock time in dd:hh:mm:ss format for the job (e.g 48 hours).
* '''#SBATCH --time=4:00:00''' : specify the maximum walltime of the job in the format D-HH:MM:SS (e.g. --time=1- for one day or --time=4:00:00 for 4 hours)
* '''#SBATCH --mem=10gb''' : used to specify the maximum memory allowed for the job (e.g. 10GB)
* '''#SBATCH --mem=10G''' : specify the maximum memory per node required by the job (e.g. 10GB)
 


Below are some of the most commonly used queueing system options to configure the job.
Below are some of the most commonly used queueing system options to configure the job.
Line 76: Line 145:


* -t, --time=time
* -t, --time=time
     Wall clock time limit of a job running on cluster. Acceptable formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes", and "days-hours:minutes:seconds".
     Wall clock time limit of a job running on cluster. Acceptable formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes", and "days-hours:minutes:seconds". '''This is a required option.'''


* --mem=num
* --mem=num
     Maximum amount of memory in MegaBytes per node required by the job.
     Maximum amount of memory in MegaBytes per node required by the job. Different units can be specified using the suffix [K|M|G|T].


* --mem-per-cpu=num
* --mem-per-cpu=num
     Minimum amount of memory in MegaBytes per allocated CPU.
     Minimum amount of memory in MegaBytes per allocated CPU. Different units can be specified using the suffix [K|M|G|T].


* -n, --ntasks=num
* -n, --ntasks=num
     Number of tasks to run. The default is one task per node.
     Number of tasks to run. The default is one task per node. For use with distributed parallelism. See below.


* -N, --nodes=num
* -N, --nodes=num
     Number of nodes be allocated to the job. Default is one node.
     Number of nodes allocated to the job. Default is one node.  


* --ntasks-per-node=ntasks
* --ntasks-per-node=num
     Request that ntasks be invoked on each node. Meant to be used with the --nodes option.
     Number of tasks invoked on each node. Meant to be used with the --nodes option. For use with distributed parallelism. See below.


* -c, --cpus-per-task=ncpus
* -c, --cpus-per-task=ncpus
     Request that ncpus be allocated per process. This may be useful if the job is multithreaded and requires more than one CPU per task for optimal performance. The default is one CPU per process.
     Number of CPUs allocated to each task. For use with shared memory parallelism. See below.
 
* -C, --constraint=<list>
    List of node features required by the job.  Only nodes having features matching the job constraints will be used to satisfy the request. Multiple constraints may be specified with AND, OR, matching OR, resource  counts,  etc.
 
* --gres=<list>
    A comma  delimited  list  of  generic  consumable  resources. For example, to request one P100 GPU card: --gres=gpu:P100:1




Line 115: Line 190:
* --mail-type=type
* --mail-type=type
     Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL, TIME_LIMIT, TIME_LIMIT_90 (reached 90 percent of time limit), TIME_LIMIT_80 and TIME_LIMIT_50.
     Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL, TIME_LIMIT, TIME_LIMIT_90 (reached 90 percent of time limit), TIME_LIMIT_80 and TIME_LIMIT_50.
By default, email notifications set for an array job will generate one email message for the array job. If you would like to receive an email message for individual array job elements (up to a certain limit), please add ARRAY_TASKS to the --mail-type option.


====Options to set Array Jobs====
====Options to set Array Jobs====
Line 126: Line 203:
</pre>
</pre>


The ID of each element in an array job, i.e., job array index value, is stored in '''SLURM_ARRAY_TASK_ID'''. '''SLURM_ARRAY_JOB_ID''' will be set to the first job ID of the array. '''SLURM_ARRAY_TASK_COUNT''' will be set to the number of tasks in the job array. '''SLURM_ARRAY_TASK_MAX''' will be set to the highest job array index value. '''SLURM_ARRAY_TASK_MIN''' will be set to the lowest job array index value. Each array job element runs as an independent job, so multiple array elements can run concurrently, if resources are available. For example:
Each array job element runs as an independent job, so multiple elements can run concurrently if resources are available. For this reason, the job ID which is stored in SLURM_JOB_ID for each element in an array job will be different and unique. The ID of each element in an array job, i.e., array element index value, is stored in SLURM_ARRAY_TASK_ID. The ID of an array job as whole is stored in SLURM_ARRAY_JOB_ID. For this reason, it will be the same for all elements in an array job. The JodID reported by sq command is a combination of SLURM_ARRAY_JOB_ID and SLURM_ARRAY_TASK_ID connected by "_".  


<pre class="gscript">
<pre class="gscript">
Line 158: Line 235:


Most Slurm commands recognize the SLURM_ARRAY_JOB_ID plus SLURM_ARRAY_TASK_ID values separated by an underscore as identifying an element of a job array, for example, 36_2 would be equivalent ways to identify the second array element of array job 36.
Most Slurm commands recognize the SLURM_ARRAY_JOB_ID plus SLURM_ARRAY_TASK_ID values separated by an underscore as identifying an element of a job array, for example, 36_2 would be equivalent ways to identify the second array element of array job 36.
For more information, please see [[Array Jobs]].


====Option to set job dependency====
====Option to set job dependency====
Line 169: Line 248:
<pre class="gscript">
<pre class="gscript">
#SBATCH --dependency=after:1236:1237
#SBATCH --dependency=after:1236:1237
</pre>
====Options to requeue or not requeue a job when a node crashes====
If a job is running and one or more nodes that it is using crash, the job will stop running and, by default, it will get requeued. When resources become available, the job will start running again, from the beginning, unless the program saves intermediate results and it is able to automatically pick up from where it stopped. The files with the standard error and standard output of the job will get rewritten once the job restarts. Often other output files will get rewritten as well.
If you are running a program that cannot restart, e.g. the program will fail if a certain output file or directory has already been created, or if you would like to preserve the partial results, you can use the following option to prevent the job from being requeued:
<pre class="gscript">
#SBATCH --no-requeue
</pre>
When this option is used, the job will simply stop if a node crashes, it will not be requeued. In this case partial results and the standard error and output of the job will not get overwritten.
Although requeueing jobs is enabled by default now, you can also add the option below if you would like to ensure a job is requeued in the event of a node crash:
<pre class="gscript">
#SBATCH --requeue
</pre>
</pre>


Line 177: Line 271:
cd $SLURM_SUBMIT_DIR
cd $SLURM_SUBMIT_DIR
</pre>
</pre>
(Note that Slurm jobs start from the submit directory by default, so adding the line above might not be necessary.)
You can then load the needed modules. For example, if you are running an R program, then include the line
You can then load the needed modules. For example, if you are running an R program, then include the line
<pre class="gscript">
<pre class="gscript">
module load R/3.6.2-foss-2019b
module load R/4.3.1-foss-2022a
</pre>
</pre>


Line 260: Line 356:
#SBATCH --time=02:00:00              # Time limit hrs:min:sec
#SBATCH --time=02:00:00              # Time limit hrs:min:sec
#SBATCH --output=testserial.%j.out    # Standard output log
#SBATCH --output=testserial.%j.out    # Standard output log
#SBATCH --error=testserial.%j.err   # Standard error log
#SBATCH --error=testserial.%j.err     # Standard error log


#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=username@uga.edu  # Where to send mail
#SBATCH --mail-user=username@uga.edu  # Where to send mail (change username@uga.edu to your email address)


cd $SLURM_SUBMIT_DIR
cd $SLURM_SUBMIT_DIR


module load R/3.6.2-foss-2019b
module load R/4.3.1-foss-2022a


R CMD BATCH add.R
R CMD BATCH add.R
Line 273: Line 369:


In this sample script, the standard output and error of the job will be saved into a file called testserial.o%j, where %j will be automatically replaced by the job id of the job.
In this sample script, the standard output and error of the job will be saved into a file called testserial.o%j, where %j will be automatically replaced by the job id of the job.
====Serial (single-processor) Job on an AMD EPYC Milan processor====
Sample job submission script (sub.sh) to run an R program called add.R using a single core:
<pre class="gscript">
#!/bin/bash
#SBATCH --job-name=testserial        # Job name
#SBATCH --partition=batch            # Partition (queue) name
#SBATCH --constraint=Milan            # node feature
#SBATCH --ntasks=1                    # Run on a single CPU
#SBATCH --mem=1gb                    # Job memory request
#SBATCH --time=02:00:00              # Time limit hrs:min:sec
#SBATCH --output=testserial.%j.out    # Standard output log
#SBATCH --error=testserial.%j.err    # Standard error log
#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=username@uga.edu  # Where to send mail (change username@uga.edu to your email address)
cd $SLURM_SUBMIT_DIR
module load R/4.3.1-foss-2022a
R CMD BATCH add.R
</pre>
In this sample script, the standard output and error of the job will be saved into a file called testserial.%j.out and testserial.%j.err, where %j will be automatically replaced by the job id of the job.


====MPI Job====
====MPI Job====
Line 280: Line 403:
<pre class="gscript">
<pre class="gscript">
#!/bin/bash
#!/bin/bash
#SBATCH --job-name=mpitest     # Job name
#SBATCH --job-name=mpitest           # Job name
#SBATCH --partition=batch            # Partition (queue) name
#SBATCH --nodes=2                    # Number of nodes
#SBATCH --ntasks=16                  # Number of MPI ranks
#SBATCH --ntasks-per-node=8          # How many tasks on each node
#SBATCH --cpus-per-task=1            # Number of cores per MPI rank
#SBATCH --mem-per-cpu=600mb          # Memory per processor
#SBATCH --time=02:00:00              # Time limit hrs:min:sec
#SBATCH --output=mpitest.%j.out      # Standard output log
#SBATCH --error=mpitest.%j.err        # Standard error log
 
#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=username@uga.edu  # Where to send mail (change username@uga.edu to your email address)
 
cd $SLURM_SUBMIT_DIR
 
module load OpenMPI/4.1.4-GCC-11.3.0
 
srun ./mympi.exe
</pre>
 
Please note that you need to start the application with '''srun''' and not with '''mpirun''' or '''mpiexec'''.
 
'''Important note:''' MPI jobs need to be submitted from a Sapelo2 login node, not from an interactive session, in order to get the correct core allocation for the MPI processes.
 
====MPI Job on nodes connected via the EDR IB fabric====
 
Sample job submission script (sub.sh) to run an OpenMPI application. In this example the job requests 16 cores and further specifies that these 16 cores need to be divided equally on 2 nodes (8 cores per node) and the binary is called mympi.exe:
 
<pre class="gscript">
#!/bin/bash
#SBATCH --job-name=mpitest            # Job name
#SBATCH --partition=batch            # Partition (queue) name
#SBATCH --partition=batch            # Partition (queue) name
#SBATCH --nodes=2               # Number of nodes
#SBATCH --constraint=EDR              # node feature
#SBATCH --ntasks=16                 # Number of MPI ranks
#SBATCH --nodes=2                     # Number of nodes
#SBATCH --ntasks-per-node=8         # How many tasks on each node
#SBATCH --ntasks=16                   # Number of MPI ranks
#SBATCH --cpus-per-task=1           # Number of cores per MPI rank  
#SBATCH --ntasks-per-node=8           # How many tasks on each node
#SBATCH --mem-per-cpu=600mb         # Memory per processor
#SBATCH --cpus-per-task=1             # Number of cores per MPI rank  
#SBATCH --time=02:00:00             # Time limit hrs:min:sec
#SBATCH --mem-per-cpu=600mb           # Memory per processor
#SBATCH --output=mpitest.%j.out         # Standard output log
#SBATCH --time=02:00:00               # Time limit hrs:min:sec
#SBATCH --error=mpitest.%j.err         # Standard error log
#SBATCH --output=mpitest.%j.out       # Standard output log
#SBATCH --error=mpitest.%j.err       # Standard error log


#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=username@uga.edu  # Where to send mail
#SBATCH --mail-user=username@uga.edu  # Where to send mail (change username@uga.edu to your email address)


cd $SLURM_SUBMIT_DIR
cd $SLURM_SUBMIT_DIR


module load OpenMPI/3.1.4-GCC-8.3.0
module load OpenMPI/4.1.4-GCC-11.3.0


mpirun ./mympi.exe
srun ./mympi.exe
</pre>
</pre>
Please note that you need to start the application with '''srun''' and not with '''mpirun''' or '''mpiexec'''.
'''Important note:''' MPI jobs need to be submitted from a Sapelo2 login node, not from an interactive session, in order to get the correct core allocation for the MPI processes.


====OpenMP (Multi-Thread) Job====
====OpenMP (Multi-Thread) Job====
Line 307: Line 466:
<pre class="gscript">
<pre class="gscript">
#!/bin/bash
#!/bin/bash
#SBATCH --job-name=mctest     # Job name
#SBATCH --job-name=mctest             # Job name
#SBATCH --partition=batch            # Partition (queue) name
#SBATCH --partition=batch            # Partition (queue) name
#SBATCH --ntasks=1                   # Run a single task
#SBATCH --ntasks=1                   # Run a single task
#SBATCH --cpus-per-task=6           # Number of CPU cores per task
#SBATCH --cpus-per-task=6             # Number of CPU cores per task
#SBATCH --mem=4gb                   # Job memory request
#SBATCH --mem=4gb                     # Job memory request
#SBATCH --time=02:00:00             # Time limit hrs:min:sec
#SBATCH --time=02:00:00               # Time limit hrs:min:sec
#SBATCH --output=mctest.%j.out         # Standard output log
#SBATCH --output=mctest.%j.out       # Standard output log
#SBATCH --error=mctest.%j.err         # Standard error log
#SBATCH --error=mctest.%j.err         # Standard error log


#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=username@uga.edu  # Where to send mail
#SBATCH --mail-user=username@uga.edu  # Where to send mail (change username@uga.edu to your email address)


cd $SLURM_SUBMIT_DIR
cd $SLURM_SUBMIT_DIR
Line 323: Line 482:
export OMP_NUM_THREADS=6   
export OMP_NUM_THREADS=6   


module load foss/2019b # load the appropriate module file, e.g. foss/2019b
module load foss/2022a # load the appropriate module file, e.g. foss/2022a


time ./a.out
time ./a.out
</pre>
</pre>


====High Memory Job====
====High Memory Job====


Sample job submission script (sub.sh) to run a velvet application that needs to use 50GB of memory and 4 threads:
Sample job submission script (sub.sh) to run a velvet application that needs to use 200GB of memory and 4 threads:


<pre class="gscript">
<pre class="gscript">
#!/bin/bash
#!/bin/bash
#SBATCH --job-name=highmemtest     # Job name
#SBATCH --job-name=highmemtest       # Job name
#SBATCH --partition=highmem            # Partition (queue) name
#SBATCH --partition=highmem_p        # Partition (queue) name
#SBATCH --ntasks=1                   # Run a single task
#SBATCH --ntasks=1                   # Run a single task
#SBATCH --cpus-per-task=4         # Number of CPU cores per task
#SBATCH --cpus-per-task=4             # Number of CPU cores per task
#SBATCH --mem=50gb                    # Job memory request
#SBATCH --mem=200gb                  # Job memory request
#SBATCH --time=02:00:00             # Time limit hrs:min:sec
#SBATCH --time=02:00:00               # Time limit hrs:min:sec
#SBATCH --output=highmemtest.%j.out     # Standard output log
#SBATCH --output=highmemtest.%j.out   # Standard output log
#SBATCH --error=highmemtest.%j.err     # Standard error log
#SBATCH --error=highmemtest.%j.err   # Standard error log


#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=username@uga.edu  # Where to send mail
#SBATCH --mail-user=username@uga.edu  # Where to send mail (change username@uga.edu to your email address)
 


cd $SLURM_SUBMIT_DIR
cd $SLURM_SUBMIT_DIR
Line 364: Line 523:
#SBATCH --job-name=hybridtest
#SBATCH --job-name=hybridtest
#SBATCH --partition=batch            # Partition (queue) name
#SBATCH --partition=batch            # Partition (queue) name
#SBATCH --nodes=2             # Number of nodes
#SBATCH --nodes=2                     # Number of nodes
#SBATCH --ntasks=8             # Number of MPI ranks
#SBATCH --ntasks=8                   # Number of MPI ranks
#SBATCH --ntasks-per-node=4   # Number of MPI ranks per node
#SBATCH --ntasks-per-node=4           # Number of MPI ranks per node
#SBATCH --cpus-per-task=3     # Number of OpenMP threads for each MPI process/rank
#SBATCH --cpus-per-task=3             # Number of OpenMP threads for each MPI process/rank
#SBATCH --mem-per-cpu=2000mb   # Per processor memory request
#SBATCH --mem-per-cpu=2000mb         # Per processor memory request
#SBATCH --time=2-00:00:00     # Walltime in hh:mm:ss or d-hh:mm:ss (2 days in the example)
#SBATCH --time=2-00:00:00             # Walltime in hh:mm:ss or d-hh:mm:ss (2 days in the example)
#SBATCH --output=hybridtest.%j.out # Standard output log
#SBATCH --output=hybridtest.%j.out   # Standard output log
#SBATCH --error=hybridtest.%j.err   # Standard error log
#SBATCH --error=hybridtest.%j.err     # Standard error log
   
   
#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=username@uga.edu  # Where to send mail
#SBATCH --mail-user=username@uga.edu  # Where to send mail (change username@uga.edu to your email address)
 


cd $SLURM_SUBMIT_DIR
cd $SLURM_SUBMIT_DIR
 
module load OpenMPI/4.1.4-GCC-11.3.0
 
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK


mpirun ./myhybridprog.exe
srun ./myhybridprog.exe


</pre>
</pre>
Line 389: Line 551:
<pre class="gscript">
<pre class="gscript">
#!/bin/bash
#!/bin/bash
#SBATCH --job-name=arrayjobtest   # Job name
#SBATCH --job-name=arrayjobtest       # Job name
#SBATCH --partition=batch            # Partition (queue) name
#SBATCH --partition=batch            # Partition (queue) name
#SBATCH --ntasks=1                 # Run a single task
#SBATCH --ntasks=1                   # Run a single task
#SBATCH --mem=1gb                   # Job Memory
#SBATCH --mem=1gb                     # Job Memory
#SBATCH --time=10:00:00             # Time limit hrs:min:sec
#SBATCH --time=10:00:00               # Time limit hrs:min:sec
#SBATCH --output=array_%A-%a.out   # Standard output log
#SBATCH --output=array_%A-%a.out     # Standard output log
#SBATCH --error=array_%A-%a.err   # Standard error log
#SBATCH --error=array_%A-%a.err       # Standard error log
#SBATCH --array=0-9                 # Array range
#SBATCH --array=0-9                   # Array range


cd $SLURM_SUBMIT_DIR
cd $SLURM_SUBMIT_DIR


module load foss/2019b # load any needed module files, e.g. foss/2019b
module load foss/2022a # load any needed module files, e.g. foss/2022a


time ./a.out < input_$SLURM_ARRAY_TASK_ID
time ./a.out < input_$SLURM_ARRAY_TASK_ID
Line 406: Line 568:
</pre>
</pre>


For more information, please see [[Array Jobs]].


====GPU/CUDA====
====GPU/CUDA====


To be added (as of June 18, 2020).
Sample script to run Amber on a GPU node using one node, 2 CPU cores, and 1 GPU card:
<pre class="gscript">
#!/bin/bash
#SBATCH --job-name=amber              # Job name
#SBATCH --partition=gpu_p            # Partition (queue) name
#SBATCH --gres=gpu:A100:1                  # Requests one GPU device
#SBATCH --ntasks=1                    # Run a single task
#SBATCH --cpus-per-task=2            # Number of CPU cores per task
#SBATCH --mem=40gb                    # Job memory request
#SBATCH --time=10:00:00              # Time limit hrs:min:sec
#SBATCH --output=amber.%j.out        # Standard output log
#SBATCH --error=amber.%j.err          # Standard error log
 
#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=username@uga.edu  # Where to send mail (change username@uga.edu to your email address)
 
cd $SLURM_SUBMIT_DIR
 
ml Amber/22.0-foss-2021b-AmberTools-22.3-CUDA-11.4.1
 
$AMBERHOME/bin/pmemd.cuda -O -i ./prod.in -o prod.out  -p ./dimerFBP_GOL.prmtop -c ./restart.rst -r prod.rst -x prod.mdcrd
</pre>
You can explicitly request a GPU device type for your job. For example:
 
*To request an A100 device, use <code>#SBATCH --gres=gpu:A100:1</code>
 
*To request an H100 device, use <code>#SBATCH --gres=gpu:H100:1</code>
 
*To request an L4 device, use <code>#SBATCH --gres=gpu:L4:1</code>
 
*To request a P100 device, use <code>#SBATCH --gres=gpu:P100:1</code>
 
Jobs that request a GPU, but that do not specify the device type (that is, jobs that use <code>#SBATCH --gres=gpu:1</code>) will get allocated any device type, some of which might not work for the application that you are running. Please check which GPU device is supported by the application or code your job is running and request the corresponding GPU device type. For more information about the GPU resources available on Sapelo2, please see https://wiki.gacrc.uga.edu/wiki/GPU and https://wiki.gacrc.uga.edu/wiki/GPU_Hardware.
 
====Singularity job====
 
Sample job submission script (sub.sh) to run a program (e.g. sortmerna) using a singularity container:
 
<pre class="gscript">
#!/bin/bash
#SBATCH --job-name=j_sortmerna        # Job name
#SBATCH --partition=batch            # Partition (queue) name
#SBATCH --ntasks=1                    # Run on a single CPU
#SBATCH --mem=1gb                    # Job memory request
#SBATCH --time=02:00:00              # Time limit hrs:min:sec
#SBATCH --output=sortmerna.%j.out    # Standard output log
#SBATCH --error=sortmerna.%j.err      # Standard error log
#SBATCH --cpus-per-task=4            # Number of CPU cores per task
#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=username@uga.edu  # Where to send mail (change username@uga.edu to your email address)
 
cd $SLURM_SUBMIT_DIR
 
singularity exec /apps/singularity-images/sortmerna-3.0.3.simg sortmerna \
--threads 4 --ref db.fasta,db.idx --reads file.fa --aligned base_name_output
</pre>
For more information about software installed as singularity containers on the cluster, please see [[Software_on_Sapelo2#Singularity_Containers]]
 
To run a GPU-enabled singularity container on the GPU, please submit the job to the gpu_p partition, request a GPU device and add the '''--nv''' option to the singularity command.
 
Sample job submission script (sub.sh) to run a program using a singularity container (e.g. gpuapp.sif) on the GPU device:
 
<pre class="gscript">
#!/bin/bash
#SBATCH --job-name=myjobname          # Job name
#SBATCH --partition=gpu_p            # Partition (queue) name
#SBATCH --gres=gpu:1                  # Requests one GPU device
#SBATCH --ntasks=1                    # Run on a single CPU
#SBATCH --mem=10gb                    # Job memory request
#SBATCH --time=02:00:00              # Time limit hrs:min:sec
#SBATCH --cpus-per-task=1            # Number of CPU cores per task
 
cd $SLURM_SUBMIT_DIR
 
singularity exec --nv /apps/singularity-images/gpuapp.sif prog.x 
</pre>
For more information about software installed as singularity containers on the cluster, please see [[Software_on_Sapelo2#Singularity_Containers]]


----
----
[[#top|Back to Top]]
[[#top|Back to Top]]


===How to submit a job to the batch queue===
===How to submit a batch job===


With the resource requirements specified in the job submission script (sub.sh), submit your job with
With the resource requirements specified in the job submission script (sub.sh), submit your job with
Line 438: Line 677:


<pre class="gcommand">
<pre class="gcommand">
To be added (as June 18, 2020).
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
batch*      up  7-00:00:00      1 drain* ra4-2
batch*      up  7-00:00:00      3  down* d4-7,ra3-19,ra4-12
batch*      up  7-00:00:00      1    mix b1-2
batch*      up  7-00:00:00      1  alloc b1-3
batch*      up  7-00:00:00    53  idle b1-[4-24],c1-3,c5-19,d4-[5-6,8-12],ra3-[1-18,20-24]
gpu_p        up  7-00:00:00      1    mix c4-23
highmem_p    up  7-00:00:00      6  idle d4-[11-12],ra4-[21-24]
inter_p      up  2-00:00:00      2  idle ra4-[16-17]
</pre>
</pre>
where some common values of STATE are:
where some common values of STATE are:
Line 444: Line 691:
*STATE=mix indicates that some cores on those nodes are in use (and some are free).
*STATE=mix indicates that some cores on those nodes are in use (and some are free).
*STATE=alloc indicates that all cores on those nodes are in use.
*STATE=alloc indicates that all cores on those nodes are in use.
*STATE=drain indicates that nodes are draining, not accepting new jobs
*STATE=down indicates that nodes are not running or accepting new jobs
This command can be used with many options. We have configured one option that shows some quantities that are commonly of interest, including node feature defined for each node. This command is
<pre class="gcommand">
sinfo-gacrc
</pre>
You can also specify the number of characters displayed in the NODELIST column (e.g. 40) and in the AVAIL_FEATURES column (e.g. 50), with
<pre class="gcommand">
sinfo-gacrc 40 50
</pre>
Sample output of the '''sinfo-gacrc''' command:
<pre class="gcommand">
PARTITION      NODELIST          STATE      CPUS  MEMORY  AVAIL_FEATURES        GRES     
batch*          ra4-2              drained*  32    126000  AMD,Opteron,QDR      lscratch:230       
batch*          ra3-19            down*      32    126000  AMD,Opteron,QDR      lscratch:230 
batch*          ra4-12            down*      32    126000  AMD,Opteron,QDR      lscratch:230
batch*          b1-3              mixed      64    126976  AMD,EPYC,Rome,EDR    lscratch:890   
batch*          b1-2              allocated  64    126976  AMD,EPYC,Rome,EDR    lscratch:890
batch*          b1-[4-24]          idle      64    126976  AMD,EPYC,Rome,EDR    lscratch:890   
batch*          c1-3              idle      28    59127    Intel,Broadwell,EDR  lscratch:890   
batch*          c5-19              idle      32    187868  Intel,Skylake,EDR    lscratch:890   
batch*          d4-[5-6]          idle      32    126976  AMD,EPYC,Naples,EDR  lscratch:890   
batch*          d4-[8-12]          idle      32    126976+  AMD,EPYC,Naples,EDR  lscratch:890   
batch*          ra3-[1-18,20-24]  idle      32    126000  AMD,Opteron,QDR      lscratch:230       
gpu_p          c4-23              idle      32    187868  Intel,Skylake,EDR    gpu:P100:1,lscratch:890
highmem_p      d4-[11-12]        idle      32    514048  AMD,EPYC,Naples,EDR  lscratch:890 
highmem_p      ra4-[21-24]        idle      32    126000  AMD,Opteron,QDR      lscratch:230
inter_p        ra4-[16-17]        idle      32    126000  AMD,Opteron,QDR      lscratch:230
scavenge_p      rb7-18            idle      28    515780  Intel,Broadwell,QDR  lscratch:180
</pre>
----
[[#top|Back to Top]]
===What is the scavenge_p partition===
A portion of the Sapelo2 compute nodes were purchased by UGA PIs and their group members have priority in using those resources (also referred to as buyin nodes). The GACRC purchased the rest on UGA's behalf. The agreement for the PI-owned nodes allows "other users" to also run jobs on owned nodes, as long as those jobs don't cause that lab group to wait over two hours for access to its nodes. We have implemented a partition called scavenge_p and short jobs (for example, jobs that request less than 4h) submitted to the 'batch' partition might be automatically moved into the scavenge_p partition if the 'batch' partition is busy. This is a way to reduce the wait time of the short jobs, while making use of the buyin nodes that are not in use. Jobs running on the buyin nodes (or any nodes) cannot be dynamically migrated to other nodes, so buyin-group users might have to wait up to 4h to access their nodes, if there are jobs running in the scavenge_p partition.
Users cannot submit jobs directly to the scavenge_p partition, but if you submitted short jobs to the batch partition, you might see them running on the scavenge_p partition.
----
[[#top|Back to Top]]
===How to request a specific node feature===
Each compute node has a set of features, such as shown with the sinfo-gacrc command above. Common features are Intel (if the node has Intel processors), AMD (if the node has AMD processors), EPYC (if the node has AMD EPYC processors), or specific EPYC processor types, such as Rome, Milan, etc. You can request using nodes with a specific feature by adding the following header line in your job submission script:
<pre class="gscript">
#SBATCH --constraint=featurename
</pre>
where '''featurename''' needs to be replaced by the feature you want to use. For example, to request that the job goes to a node that has a Milan processor, use
<pre class="gscript">
#SBATCH --constraint=Milan
</pre>
----
[[#top|Back to Top]]
===How to run Intel- or AMD-specific applications===
Most of the applications that GACRC installs centrally can be run on Intel and on AMD processors, but some exceptions do exist. Also, some third-party applications that you are using might have been pre-compiled for a given processor type and would fail if run on a different processor architecture If an application that you are using if only compatible with one type of processor (e.g. Intel), you can request that node feature by adding the following line in your job submission script
<pre class="gscript">
#SBATCH --constraint=Intel
</pre>
or
<pre class="gscript">
#SBATCH --constraint=EPYC
</pre>
or
<pre class="gscript">
#SBATCH --constraint=Milan
</pre>
----
[[#top|Back to Top]]
=== How to run a job using the local scratch /lscratch on a compute node ===
The IO performance of the local scratch file system /lscratch is much faster than the IO performance of the network file system /scratch. '''Please note''' that the local scratch file system can only be used for running single-node jobs, i.e., single-core jobs or multi-thread jobs. In general, MPI parallel jobs that use more than one node cannot use the local scratch file system. Detailed information and instructions about /lscratch can be found at [[Disk_Storage#lscratch_file_system]] .
To use /lscratch to run a batch job, you need a few additional steps in your job submission script to ask your job to:
# Create a job working folder in /lscratch on the compute node where your job is dispatched
# Copy any input files required to run the job from your current working space, e.g., /scratch/MyID, to the folder created in step 1
# Change directory from your current working space /scratch/MyID to the folder created in step 1 and run the software from there, i.e. from the local scratch file system /lscratch
# Copy output results from /lscratch back to your /scratch/MyID, before job finishes and exits from the node
# Clean up in /lscratch, before job finishes and exits from the node
To use /lscratch to run a batch job, you also need to:
1. Make sure that your job will use a single node by using the following line in your job submission script:
<pre class="gscript">
#SBATCH --nodes=1
</pre>
2. Request an appropriate amount of disk storage from the local scratch file system by adding the following line in your job submission script:
<pre class="gscript">
#SBATCH --gres=lscratch:200
</pre>
The above header requests 200GB local storage on the compute node where your job is dispatched.
Below is a sample job submission script (sub.sh) to run a batch job using /lscratch:
<pre class="gscript">
#!/bin/bash
#SBATCH --job-name=RM_job
#SBATCH --partition=batch
#SBATCH --nodes=1
#SBATCH --gres=lscratch:200
#SBATCH --ntasks=12
#SBATCH --mem=36G
#SBATCH --time=7-00:00:00
#SBATCH --output=log.%j.out
#SBATCH --error=log.%j.err
cd $SLURM_SUBMIT_DIR
# Step 1
mkdir -p /lscratch/${USER}/${SLURM_JOB_ID}
# Step 2
cp ./Hawaii_H3_Final.fa /lscratch/${USER}/${SLURM_JOB_ID}
# Step 3
cd /lscratch/${USER}/${SLURM_JOB_ID}
ml RepeatModeler/2.0.4-foss-2022a
BuildDatabase -name E4 -engine ncbi Hawaii_H3_Final.fa
RepeatModeler -engine ncbi -pa 3 -database E4 > E4-repeat.out
# Step 4
cp ./E4* ${SLURM_SUBMIT_DIR}
cp -r ./RM_* ${SLURM_SUBMIT_DIR}
# Step 5
rm -rf /lscratch/${USER}/${SLURM_JOB_ID}
</pre>
Then submit sub.sh from your current working space /scratch/MyID with:
<pre class="gcommand">
sbatch sub.sh
</pre>
Since you submit the job from /scratch/MyID, the value stored in SLURM_SUBMIT_DIR in the above sub.sh will be /scratch/MyID.
To learn the total amount of local disk storage installed in compute nodes on Sapelo2, you can use '''sinfo-gacrc''' command. The '''GRES''' column reported is the information about the total amount of local disk storage in GB, for example, '''lscratch:890''' means total 890GB local disk storage is installed in the compute node(s). Detailed instructions about gacrc-sinfo can be found at [[Running_Jobs_on_Sapelo2#Discovering_if_a_partition_.28queue.29_is_busy]]


----
----
Line 452: Line 849:
An interactive session on a compute node can be started with the command
An interactive session on a compute node can be started with the command
<pre class="gcommand">
<pre class="gcommand">
qlogin (to be added as of June 18, 2020)
interact
</pre>
This command, invoked without any arguments, will start an interactive session with one core on one of the interactive nodes, and allocate 2GB of memory for a maximum walltime of 12h. It is equivalent to the <code>qlogin</code> command that we used previously, and it runs
<pre class="gcommand">
srun --pty  --cpus-per-task=1 --job-name=interact --ntasks=1 --nodes=1 --partition=inter_p --time=12:00:00 --mem=2GB /bin/bash -l
</pre>
When the <code>interact</code> command is run, it will echo the equivalent srun command, so you can easily check the resources associated to your interactive session.
 
The <code>interact</code> command takes arguments that allow you to request cores, memory, walltime limit, specific node features, or a different partition and other resources.
 
The options that can be used with <code>interact</code> are diplayed when this command is run with the -h or --help option:
 
<pre class="gcomment">
[shtsai@ss-sub2 ~]$ interact -h
 
Usage: interact [OPTIONS]
 
Description: Start an interactive job
 
    -c, --cpus-per-task        CPU cores per task (default: 1)
    -J, --job-name              Job name (default: interact)
    -n, --ntasks                Number of tasks (default: 1)
    -N, --nodes            Number of nodes (default: 1)
    -p, --partition            Partition for interactive job (default: inter_p)
    -q, --qos              Request a quality of service for the job.
    -t, --time              Maximum run time for interactive job (default: 12:00:00)
    -w, --nodelist              List of node name(s) on which your job should run
    --constraint                Job constraints
    --gres                  Generic consumable resources
    --mem                  Memory per node (default 2GB)
    --shell                Absolute path to the shell to be used in your interactive job (default: /bin/bash)
    --wckey                Wckey to be used with job
    --x11                  Start an interactive job with X Forwarding
    -h, --help              Display this help output
</pre>
 
'''Examples:'''
 
To start an interactive session with 4 cores and 10GB of memory:
<pre class="gcommand">
interact -c 4 --mem=10G
</pre>
</pre>
This command will start an interactive session with one core on one of the interactive nodes, and allocate 2GB of memory for a maximum walltime of 12h.


The '''qlogin''' command is an alias for
To start an interactive session with 1 core, 10GB of memory and a walltime limit of 18 hours:
<pre class="gcommand">
<pre class="gcommand">
srun --pty -p interq --time=12:00:00 --mem=2gb  bash
interact --mem=10G --time=18:00:00
</pre>
 
To start an interactive session with 1 core, 2GB of memory, on a node that has an AMD EPYC Milan processor in the batch partition:
<pre class="gcommand">
interact --constraint=Milan -p batch
</pre>
 
To start an interactive session with 1 core, 50GB of memory, and a A100 GPU device:
<pre class="gcommand">
interact -p gpu_p --gres=gpu:A100:1 --mem=50G
</pre>
</pre>


Line 466: Line 912:
===How to run an interactive job with Graphical User Interface capabilities===
===How to run an interactive job with Graphical User Interface capabilities===


If you want to run an application as an interactive job and have its graphical  
A number of software installed on GACRC clusters have X Window (GUI) front ends. Examples of such applications are Matlab, Mathematica, some text editors and debuggers, etc. The best way to run such applications is using the Open OnDemand (OOD) interface to Sapelo2, either by running an interactive application in OOD or by starting an X Desktop session on the cluster and running the application therein. More information is available at [[OnDemand]].
 
If using OnDemand is not an option, and you want to run an application as an interactive job and have its graphical  
user interface displayed on the terminal of your local machine, you need to  
user interface displayed on the terminal of your local machine, you need to  
enable X-forwarding when you ssh into the login node. For information on how  
enable X-forwarding when you ssh into the login node. For information on how  
to do this, please see questions 10 and 11 in the [[Frequently Asked Questions]]  
to do this on windows and mac, please see instructions within questions 5.4 and 5.5 in the [[Frequently Asked Questions]]  
page.
page. This can be done on a Linux machine
 
by simply adding the -X option when ssh-ing into Sapelo2.
On the teaching cluster, X-forwarding does not work from any of the compute nodes,
including the interactive nodes. Please feel free to run X windows applications
directly on the teaching cluster login node.
 
<!--
'''NOTE: X-forwarding is not working on Sapelo2 yet, sorry for the inconvenience'''


After setting up an X-forwarding terminal on your local machine, start an interactive session, but add the option --x11 to the <code>interact</code> command.


If you want to run an application as an interactive job and have its graphical
An interactive session on a compute node, with X forwarding enabled, can be started with the command
user interface displayed on the terminal of your local machine, you need to
enable X-forwarding when you ssh into the login node. For information on how
to do this, please see questions 10 and 11 in the [[Frequently Asked Questions]]
page.
 
Then start an interactive session, but add the option -X to the qsub command.
For example:
<pre class="gcommand">
<pre class="gcommand">
qsub -I -X -q s_interq -l walltime=12:00:00 -l nodes=1:ppn=1 -l mem=2gb
interact --x11
</pre>
</pre>
where the walltime has been set to 12h, the memory set to 2gb (choose appropriately), and the
This command will start an interactive session, with X forwarding enabled, with one core on one of the interactive nodes, and allocate 2GB of memory for a maximum walltime of 12h.
queue selected was s_interq, which targets interactive nodes with either Intel or AMD feature.  


The '''xqlogin''' command is an alias for  
The <code>interact --x11</code> command is an alias for  
<pre class="gcommand">
<pre class="gcommand">
qsub -I -X -q s_interq -l walltime=12:00:00 -l nodes=1:ppn=1 -l mem=2gb
srun --pty --x11 --cpus-per-task=1 --job-name=interact --ntasks=1 --nodes=1 --partition=inter_p --time=12:00:00 --mem=2GB /bin/bash -l
</pre>
</pre>
so it can be used to start an interactive session with X-forwarding enabled and with a walltime of 12h.


Once a shell prompt on an interactive node is returned, you can invoke the application.
The options available to <code>interact</code>, described in the previous section, can be used along with the <code>--x11</code> option.
If it has a GUI, that should be displayed on your local machine (laptop or
desktop).


-->
----
----
[[#top|Back to Top]]
[[#top|Back to Top]]
Line 511: Line 942:
===How to run a singularity application===
===How to run a singularity application===


There are applications installed as singularity containers under /usr/local/singularity-images.  
There are applications installed as singularity containers under /apps/singularity-images.  


The file name is in format of application-version prefix, such as /usr/local/singularity-images/trinity-2.5.1--0.simg is for Trinity version 2.5.1.
The file name is in format of application-version prefix, such as /apps/singularity-images/trinity-2.5.1--0.simg is for Trinity version 2.5.1.


For information on Singularity please visit: http://singularity.lbl.gov/
For information on Singularity please visit: http://singularity.lbl.gov/


Singularity containers have been configured to access to the user's home directory ($HOME), lustre1 directory (/lustre1), lscratch directory (/lscratch). The temp directory (/tmp) is inside the container.
Singularity containers have been configured to access to the user's home directory ($HOME), scratch directory (/scratch), lscratch directory (/lscratch). The temp directory (/tmp) is inside the container.


All environment variables set before executing singularity command is available inside the container.
All environment variables set before executing singularity command is available inside the container.
Line 527: Line 958:


<pre class="gcommand">
<pre class="gcommand">
singularity exec /usr/local/singularity-images/trinity-2.5.1--0.simg which Trinity
singularity exec /apps/singularity-images/trinity-2.5.1--0.simg which Trinity
/usr/local/bin/Trinity
/usr/local/bin/Trinity
singularity exec /usr/local/singularity-images/trinity-2.5.1--0.simg ls -al /usr/local/bin/Trinity
singularity exec /apps/singularity-images/trinity-2.5.1--0.simg ls -al /usr/local/bin/Trinity
lrwxrwxrwx    1 root    root            28 Dec  9 04:04 /usr/local/bin/Trinity -> ../opt/trinity-2.5.1/Trinity
lrwxrwxrwx    1 root    root            28 Dec  9 04:04 /usr/local/bin/Trinity -> ../opt/trinity-2.5.1/Trinity
</pre>
</pre>
Line 536: Line 967:


<pre class="gcommand">
<pre class="gcommand">
singularity exec /usr/local/singularity-images/trinity-2.5.1--0.simg ls /usr/local/opt/trinity-2.5.1  
singularity exec /apps/singularity-images/trinity-2.5.1--0.simg ls /usr/local/opt/trinity-2.5.1  
</pre>
</pre>


Line 550: Line 981:
   
   
cd $PBS_O_WORKDIR
cd $PBS_O_WORKDIR
singularity exec /usr/local/singularity-images/trinity-2.5.1--0.simg COMMAND OPTION
singularity exec /apps/singularity-images/trinity-2.5.1--0.simg COMMAND OPTION
</pre>
</pre>


Line 565: Line 996:
cd $PBS_O_WORKDIR
cd $PBS_O_WORKDIR


singularity exec /usr/local/singularity-images/trinity-2.5.1--0.simg Trinity --seqType <string> --max_memory 100G --CPU 8 --no_version_check 1>job.out 2>job.err   
singularity exec /apps/singularity-images/trinity-2.5.1--0.simg Trinity --seqType <string> --max_memory 100G --CPU 8 --no_version_check 1>job.out 2>job.err   
</pre>
</pre>


Line 660: Line 1,091:
<pre class="gcommand">
<pre class="gcommand">
squeue -l
squeue -l
</pre>
This command can be used with many options. We have wrapper to this command, called <code>sq</code> that shows some quantities that are commonly of interest. To use the <code>sq</code> command to list all of your running and pending jobs, use
<pre class="gcommand">
sq --me
</pre>
</pre>


 
For detailed information on how to monitor your jobs, please see [[Monitoring Jobs on Sapelo2]].
For detailed information on how to monitor your jobs, please see [[Monitoring Jobs on the teaching cluster]].


----
----
[[#top|Back to Top]]
[[#top|Back to Top]]


===How to delete a running or pending job===
===How to cancel (delete) a running or pending job===


To delete one of your running or pending job, use the command
To cancel one of your running or pending job, use the command
<pre class="gcommand">
<pre class="gcommand">
scancel <jobid>
scancel <jobid>
</pre>
</pre>
For example, to delete a job with Job ID 12345 use
For example, to cancel a job with Job ID 12345 use
<pre class="gcommand">
<pre class="gcommand">
scancel 12345
scancel 12345
</pre>
To cancel all of your jobs, use the command
<pre class="gcommand">
scancel -u MyID
</pre>
To cancel all of your pending jobs, use the command
<pre class="gcommand">
scancel -t PENDING -u MyID
</pre>
To cancel one or more jobs by job name, use the command
<pre class="gcommand">
scancel --name <myJobName>
</pre>
To cancel an element (index) of an array job
<pre class="gcommand">
scancel <jobid>_<index>
</pre>
For example, to cancel array job element 4 of an array job whose Job ID is 12345 use
<pre class="gcommand">
scancel 12345_4
</pre>
</pre>


Line 691: Line 1,149:
This command can be used with many options. We have configured one option that shows some quantities that are commonly of interest, including the amount of memory used and the cputime used by the jobs:
This command can be used with many options. We have configured one option that shows some quantities that are commonly of interest, including the amount of memory used and the cputime used by the jobs:
<pre class="gcommand">
<pre class="gcommand">
sacct_zh (to be added as of June 18, 2020)
sacct-gacrc
</pre>
 
For detailed information on how to monitor your jobs, please see [[Monitoring Jobs on the teaching cluster]].
<!--
'''1.''' You can request than an email be sent to you when the job finishes, by adding these two header lines to the job submission script:
<pre class="gscript">
#SBATCH --mail-type=END,FAIL 
#SBATCH --mail-user=username@uga.edu
</pre>
where ''username@uga.edu'' should be replaced by your email address (not necessarily a UGAMail address).
 
The email message will include the resource utilization of the job. 
 
'''2.''' Within 24 hours of a job completion, you can use the command
<pre class="gcommand">
qstat -f jobid
</pre>
</pre>
to check on the resource utilization (such as wall clock time, amount of memory, etc).


For detailed information on how to monitor your jobs, please see [[Monitoring Jobs on Sapelo2]].


'''3.''' Jobs that completed over one hour ago, but no longer than 7 days ago, you can use the command
<pre class="gcommand">
showjobs jobid
</pre>
to check on the resource utilization (such as wall clock time, amount of memory, etc).
-->
----
----
[[#top|Back to Top]]
[[#top|Back to Top]]

Latest revision as of 15:35, 18 October 2024


Using the Queueing System

The login node for the Sapelo2 cluster should be used for text editing, and job submissions. No jobs should be run directly on the login node. Processes that use too much CPU or RAM on the login node may be terminated by GACRC staff, or automatically, in order to keep the cluster running properly. Jobs should be run using the Slurm queueing system. The queueing system should be used to run both interactive and batch jobs.


Back to Top

Batch partitions (queues) defined on the Sapelo2

There are different partitions defined on Sapelo2. The Slurm queueing system refers to queues as partition. Users are required to specify, in the job submission script or as job submission command line arguments, the partition and the resources needed by the job in order for it to be assigned to compute node(s) that have enough available resources (such as number of cores, amount of memory, GPU cards, etc). Please note, Slurm will not allow a job to be submitted if there are no resources matching your request. Please refer to Migrating from Torque to Slurm for more info about Slurm queueing system.

The following partitions are defined on the Sapelo2 cluster:

Partition Name Time limit Max jobs Notes
batch 7 days Regular nodes.
batch-30d 30 days 2 Regular nodes. A given user can have up to one job running at a time here, plus one pending, or two pending and none running. A user's attempt to submit a third job into this partition will be rejected.
highmem_p 7 days For high memory jobs
highmem_30d_p 30 days 2 For high memory jobs. A given user can have up to one job running at a time here, plus one pending, or two pending and none running. A user's attempt to submit a third job into this partition will be rejected.
hugemem_p 7 days 4 For jobs needing up to 3TB of memory.
hugemem_30d_p 30 days 4 For jobs needing up to 3TB of memory.
gpu_p 7 days For GPU-enabled jobs.
gpu_30d_p 30 days 2 For GPU-enabled jobs. A given user can have up to one job running at a time here, plus one pending, or two pending and none running. A user's attempt to submit a third job into this partition will be rejected.
inter_p 2 days Regular nodes, for interactive jobs.
name_p variable Partitions that target different groups' buy-in nodes. The name string is specific to each group.
scavenge_p 2 hours Partition that targets the buy-in nodes. When there are no available resources in the batch partition, short jobs submitted there might be automatically transferred into scavenge_p, to run on idle buy-in resources. Jobs cannot be submitted directly to this partition.

For more detailed information about the partitions, please see Job Submission partitions on Sapelo2.


The table below summarizes the partitions (queues) defined and the compute nodes that they target:

Partition Name Node Features Node Number Description Memory for jobs Notes
batch, batch_30d AMD, Opteron, QDR 48-core, 128GB RAM, AMD Opteron, QDR IB interconnect 122GB Regular nodes.
batch, batch_30d AMD, EPYC, EDR 64-core, 128GB RAM, AMD EPYC, IB EDR interconnect 120GB Regular nodes
batch, batch_30d AMD, EPYC, EDR 32-core, 128GB RAM, AMD EPYC, IB EDR interconnect 120GB Regular nodes
batch, batch_30d AMD, Opteron, QDR 48-core, 256GB RAM, AMD Opteron, QDR IB interconnect 250GB Regular nodes.
batch, batch_30d Intel, Skylake, EDR 32-core, 192GB RAM, Intel Xeon Skylake, IB EDR interconnect 180GB Regular nodes
batch, batch_30d Intel, Broadwell, EDR 28-core, 64GB RAM, Intel Xeon Broadwell, IB EDR interconnect 58GB Regular nodes
highmem_p, highmem_30d_p AMD, EPYC, EDR 64-core, 1TB RAM, AMD EPYC, IB EDR interconnect 950GB For high memory jobs
highmem_p, highmem_30d_p Intel, EDR 32-core, 1TB RAM, Intel, IB EDR interconnect 950GB For high memory jobs
highmem_p, highmem_30d_p AMD, Opteron, EDR 48-core, 1TB RAM, AMD Opteron, IB EDR interconnect 950GB For high memory jobs
highmem_p, highmem_30d_p AMD, Opteron, QDR 48-core, 512GB, AMD Opteron, IB QDR interconnect 500GB For high memory jobs
highmem_p, highmem_30d_p AMD, EPYC, EDR 32-core, 512GB RAM, AMD EPYC, IB EDR interconnect 490GB For high memory jobs
hugemem_p, hugemem_30d_p AMD, EPYC, EDR 32-core, 2TB RAM, AMD EPYC, IB EDR interconnect 2000GB For high memory jobs
hugemem_p, hugemem_30d_p AMD, EPYC, EDR 48-core, 3TB RAM, AMD EPYC, IB EDR interconnect 3000GB For high memory jobs
gpu_p, gpu_30d_p GPU, A100, EDR 64-core, 1000GB RAM, AMD EPYC, 4 NVIDIA A100 GPUs, EDR IB interconnect 1000GB For GPU-enabled jobs.
gpu_p, gpu_30d_p GPU, P100, EDR 32-core, 192GB RAM, Intel Xeon Skylake, 1 NVIDIA P100 GPUs, EDR IB interconnect 180GB For GPU-enabled jobs.
gpu_p, gpu_30d_p GPU, K40, QDR 16-core, 128GB RAM, Intel Xeon , 8 NVIDIA K40 GPUs, QDR IB interconnect 120GB For GPU-enabled jobs.
gpu_p, gpu_30d_p GPU, K20, QDR 12-core, 96GB RAM, Intel Xeon , 7 NVIDIA K20Xm GPUs, QDR IB interconnect 70GB For GPU-enabled jobs.

You can check all partitions (queues) defined in the cluster with the command

sinfo

Back to Top

Job submission Scripts

Users are required to specify the number of cores, the amount of memory, the partition (queue) name, and the maximum wallclock time needed by the job.

Header lines

Basic job submission script

At a minimum, the job submission script needs to have the following header lines:

#!/bin/bash
#SBATCH --partition=batch
#SBATCH --job-name=test
#SBATCH --ntasks=1
#SBATCH --time=4:00:00
#SBATCH --mem=10G

Commands to run your application should be added after these header lines.

Header lines explained:

  • #!/bin/bash: specify Linux default shell bash
  • #SBATCH --partition=batch : specify the partition (queue) to run on, e.g. batch
  • #SBATCH --job-name=test : specify the job name, e.g. test
  • #SBATCH --ntasks=1 : specify the number of tasks (e.g. 1)
  • #SBATCH --time=4:00:00 : specify the maximum walltime of the job in the format D-HH:MM:SS (e.g. --time=1- for one day or --time=4:00:00 for 4 hours)
  • #SBATCH --mem=10G : specify the maximum memory per node required by the job (e.g. 10GB)

Below are some of the most commonly used queueing system options to configure the job.

Options to request resources for the job

  • -t, --time=time
   Wall clock time limit of a job running on cluster. Acceptable formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes", and "days-hours:minutes:seconds". This is a required option.
  • --mem=num
   Maximum amount of memory in MegaBytes per node required by the job. Different units can be specified using the suffix [K|M|G|T].
  • --mem-per-cpu=num
   Minimum amount of memory in MegaBytes per allocated CPU. Different units can be specified using the suffix [K|M|G|T].
  • -n, --ntasks=num
   Number of tasks to run. The default is one task per node. For use with distributed parallelism. See below.
  • -N, --nodes=num
   Number of nodes allocated to the job. Default is one node. 
  • --ntasks-per-node=num
   Number of tasks invoked on each node. Meant to be used with the --nodes option. For use with distributed parallelism. See below.
  • -c, --cpus-per-task=ncpus
   Number of CPUs allocated to each task. For use with shared memory parallelism. See below.
  • -C, --constraint=<list>
   List of node features required by the job.  Only nodes having features matching the job constraints will be used to satisfy the request.  Multiple constraints may be specified with AND, OR, matching OR, resource  counts,  etc. 
  • --gres=<list>
   A comma  delimited  list  of  generic  consumable  resources. For example, to request one P100 GPU card: --gres=gpu:P100:1 


Please try to request resources for your job as accurately as possible, because this allows your job to be dispatched to run at the earliest opportunity and it helps the system allocate resources efficiently to start as many jobs as possible, benefiting all users.

Options to manage job notification and output

  • -J, --job-name jobname
   Specify a name for the job. The specified name will appear along with the job id number when querying running jobs on the system. The default is the supplied executable program's name. Within the job, $SBATCH_JOB_NAME expands to the job name.
  • -o, --output=path/for/stdout
   Send stdout to path/for/stdout. The default filename is slurm-${SLURM_JOB_ID}.out, e.g. slurm-12345.out, in the directory from which the job was submitted.
  • -e, --error=path/for/stderr
   Send stderr to path/for/stderr. If --error is not specified, both stdout and stderr will directed to the file specified by --output.
  • --mail-user=username@uga.edu
   Send email notification to the address you specified when certain events occur.
  • --mail-type=type
   Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL, TIME_LIMIT, TIME_LIMIT_90 (reached 90 percent of time limit), TIME_LIMIT_80 and TIME_LIMIT_50.

By default, email notifications set for an array job will generate one email message for the array job. If you would like to receive an email message for individual array job elements (up to a certain limit), please add ARRAY_TASKS to the --mail-type option.

Options to set Array Jobs

If you wish to run an application binary or script using e.g. different input files, then you might find it convenient to use an array job. To create an array job with e.g. 10 elements, use

#SBATCH -a 0-9

or

#SBATCH --array=0-9

Each array job element runs as an independent job, so multiple elements can run concurrently if resources are available. For this reason, the job ID which is stored in SLURM_JOB_ID for each element in an array job will be different and unique. The ID of each element in an array job, i.e., array element index value, is stored in SLURM_ARRAY_TASK_ID. The ID of an array job as whole is stored in SLURM_ARRAY_JOB_ID. For this reason, it will be the same for all elements in an array job. The JodID reported by sq command is a combination of SLURM_ARRAY_JOB_ID and SLURM_ARRAY_TASK_ID connected by "_".

sbatch --array=1-3 -N1 sub.sh

will generate a job array containing three jobs. If the sbatch command responds
Submitted batch job 36
then the environment variables will be set as follows:

SLURM_JOB_ID=36
SLURM_ARRAY_JOB_ID=36
SLURM_ARRAY_TASK_ID=1
SLURM_ARRAY_TASK_COUNT=3
SLURM_ARRAY_TASK_MAX=3
SLURM_ARRAY_TASK_MIN=1

SLURM_JOB_ID=37
SLURM_ARRAY_JOB_ID=36
SLURM_ARRAY_TASK_ID=2
SLURM_ARRAY_TASK_COUNT=3
SLURM_ARRAY_TASK_MAX=3
SLURM_ARRAY_TASK_MIN=1

SLURM_JOB_ID=38
SLURM_ARRAY_JOB_ID=36
SLURM_ARRAY_TASK_ID=3
SLURM_ARRAY_TASK_COUNT=3
SLURM_ARRAY_TASK_MAX=3
SLURM_ARRAY_TASK_MIN=1

Most Slurm commands recognize the SLURM_ARRAY_JOB_ID plus SLURM_ARRAY_TASK_ID values separated by an underscore as identifying an element of a job array, for example, 36_2 would be equivalent ways to identify the second array element of array job 36.

For more information, please see Array Jobs.

Option to set job dependency

You can set job dependency with the option -d or --dependency=dependency-list. For example, if you want to specify that one job starts to run after the job 1234 and 1235 have successfully executed (ran to completion with an exit code of zero), you can add the following header line in the job submission script of the job:

#SBATCH --dependency=afterok:1234:1235

Having this header line in the job submission script will ensure that the job is only dispatched to run after job 1234 and 1235 have completed successfully.

You can also use the following header line to specify that one job starts to run after the job 1236 and 1237 start or are cancelled:

#SBATCH --dependency=after:1236:1237

Options to requeue or not requeue a job when a node crashes

If a job is running and one or more nodes that it is using crash, the job will stop running and, by default, it will get requeued. When resources become available, the job will start running again, from the beginning, unless the program saves intermediate results and it is able to automatically pick up from where it stopped. The files with the standard error and standard output of the job will get rewritten once the job restarts. Often other output files will get rewritten as well.

If you are running a program that cannot restart, e.g. the program will fail if a certain output file or directory has already been created, or if you would like to preserve the partial results, you can use the following option to prevent the job from being requeued:

#SBATCH --no-requeue

When this option is used, the job will simply stop if a node crashes, it will not be requeued. In this case partial results and the standard error and output of the job will not get overwritten.

Although requeueing jobs is enabled by default now, you can also add the option below if you would like to ensure a job is requeued in the event of a node crash:

#SBATCH --requeue

Other content of the script

Following the header lines, users can include commands to change to the working directory, to load the modules needed to run the application, and to invoke the application. For example, to use the directory from which the job is submitted as the working directory (where to find input files or binaries), add the line

cd $SLURM_SUBMIT_DIR

(Note that Slurm jobs start from the submit directory by default, so adding the line above might not be necessary.)

You can then load the needed modules. For example, if you are running an R program, then include the line

module load R/4.3.1-foss-2022a

Then invoke your application. For example, if you are running an R program called add.R which is in your job submission directory, use

R CMD BATCH add.R

Environment Variables exported by batch jobs

When a batch job is started, a number of variables are introduced into the job's environment that can be used by the batch script in making decisions, creating output files, and so forth. Some of these variables are listed in the following table:

Variable Description
SLURM_ARRAY_JOB_ID Job array's master job ID number, i.e., the first Slurm job id of a job array
SLURM_ARRAY_TASK_COUNT Total number of tasks (elements) in a job array
SLURM_ARRAY_TASK_ID Job array ID (index) number
SLURM_ARRAY_TASK_MAX Job array's maximum ID (index) number
SLURM_ARRAY_TASK_MIN Job array's minimum ID (index) number
SLURM_CPUS_ON_NODE Number of CPUS on the allocated node
SLURM_CPUS_PER_TASK Number of cpus requested per task. Only set if the --cpus-per-task option is specified
SLURM_JOB_ID Unique Slurm job id
SLURM_JOB_NAME Job name
SLURM_JOB_CPUS_PER_NODE Count of processors available to the job on this node
SLURM_JOB_NODELIST List of nodes allocated to the job
SLURM_JOB_NUM_NODES Total number of nodes in the job's resource allocation
SLURM_JOB_PARTITION Name of the partition (i.e. queue) in which the job is running
SLURM_MEM_PER_NODE Same as --mem
SLURM_MEM_PER_CPU Same as --mem-per-cpu
SLURM_NTASKS Same as -n, --ntasks
SLURM_NTASKS_PER_NODE Number of tasks requested per node. Only set if the --ntasks-per-node option is specified
SLURM_SUBMIT_DIR The directory from which sbatch was invoked
SLURM_SUBMIT_HOST The hostname of the computer from which sbatch was invoked
SLURM_TASK_PID The process ID of the task being started
SLURMD_NODENAME Name of the node running the job script
CUDA_VISIBLE_DEVICES GPU devide ID that assigned to the job to use



Back to Top

Sample job submission scripts

Serial (single-processor) Job

Sample job submission script (sub.sh) to run an R program called add.R using a single core:

#!/bin/bash
#SBATCH --job-name=testserial         # Job name
#SBATCH --partition=batch             # Partition (queue) name
#SBATCH --ntasks=1                    # Run on a single CPU
#SBATCH --mem=1gb                     # Job memory request
#SBATCH --time=02:00:00               # Time limit hrs:min:sec
#SBATCH --output=testserial.%j.out    # Standard output log
#SBATCH --error=testserial.%j.err     # Standard error log

#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=username@uga.edu  # Where to send mail (change username@uga.edu to your email address)

cd $SLURM_SUBMIT_DIR

module load R/4.3.1-foss-2022a

R CMD BATCH add.R

In this sample script, the standard output and error of the job will be saved into a file called testserial.o%j, where %j will be automatically replaced by the job id of the job.

Serial (single-processor) Job on an AMD EPYC Milan processor

Sample job submission script (sub.sh) to run an R program called add.R using a single core:

#!/bin/bash
#SBATCH --job-name=testserial         # Job name
#SBATCH --partition=batch             # Partition (queue) name
#SBATCH --constraint=Milan            # node feature
#SBATCH --ntasks=1                    # Run on a single CPU
#SBATCH --mem=1gb                     # Job memory request
#SBATCH --time=02:00:00               # Time limit hrs:min:sec
#SBATCH --output=testserial.%j.out    # Standard output log
#SBATCH --error=testserial.%j.err     # Standard error log

#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=username@uga.edu  # Where to send mail (change username@uga.edu to your email address)

cd $SLURM_SUBMIT_DIR

module load R/4.3.1-foss-2022a

R CMD BATCH add.R

In this sample script, the standard output and error of the job will be saved into a file called testserial.%j.out and testserial.%j.err, where %j will be automatically replaced by the job id of the job.

MPI Job

Sample job submission script (sub.sh) to run an OpenMPI application. In this example the job requests 16 cores and further specifies that these 16 cores need to be divided equally on 2 nodes (8 cores per node) and the binary is called mympi.exe:

#!/bin/bash
#SBATCH --job-name=mpitest            # Job name
#SBATCH --partition=batch             # Partition (queue) name
#SBATCH --nodes=2                     # Number of nodes
#SBATCH --ntasks=16                   # Number of MPI ranks
#SBATCH --ntasks-per-node=8           # How many tasks on each node
#SBATCH --cpus-per-task=1             # Number of cores per MPI rank 
#SBATCH --mem-per-cpu=600mb           # Memory per processor
#SBATCH --time=02:00:00               # Time limit hrs:min:sec
#SBATCH --output=mpitest.%j.out       # Standard output log
#SBATCH --error=mpitest.%j.err        # Standard error log

#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=username@uga.edu  # Where to send mail (change username@uga.edu to your email address)

cd $SLURM_SUBMIT_DIR

module load OpenMPI/4.1.4-GCC-11.3.0

srun ./mympi.exe

Please note that you need to start the application with srun and not with mpirun or mpiexec.

Important note: MPI jobs need to be submitted from a Sapelo2 login node, not from an interactive session, in order to get the correct core allocation for the MPI processes.

MPI Job on nodes connected via the EDR IB fabric

Sample job submission script (sub.sh) to run an OpenMPI application. In this example the job requests 16 cores and further specifies that these 16 cores need to be divided equally on 2 nodes (8 cores per node) and the binary is called mympi.exe:

#!/bin/bash
#SBATCH --job-name=mpitest            # Job name
#SBATCH --partition=batch             # Partition (queue) name
#SBATCH --constraint=EDR              # node feature
#SBATCH --nodes=2                     # Number of nodes
#SBATCH --ntasks=16                   # Number of MPI ranks
#SBATCH --ntasks-per-node=8           # How many tasks on each node
#SBATCH --cpus-per-task=1             # Number of cores per MPI rank 
#SBATCH --mem-per-cpu=600mb           # Memory per processor
#SBATCH --time=02:00:00               # Time limit hrs:min:sec
#SBATCH --output=mpitest.%j.out       # Standard output log
#SBATCH --error=mpitest.%j.err        # Standard error log

#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=username@uga.edu  # Where to send mail (change username@uga.edu to your email address)

cd $SLURM_SUBMIT_DIR

module load OpenMPI/4.1.4-GCC-11.3.0

srun ./mympi.exe

Please note that you need to start the application with srun and not with mpirun or mpiexec.

Important note: MPI jobs need to be submitted from a Sapelo2 login node, not from an interactive session, in order to get the correct core allocation for the MPI processes.

OpenMP (Multi-Thread) Job

Sample job submission script (sub.sh) to run a program that uses OpenMP with 6 threads. Please set --ntasks=1 and set --cpus-per-task to the number of threads you wish to use. The name of the binary in this example is a.out.

#!/bin/bash
#SBATCH --job-name=mctest             # Job name
#SBATCH --partition=batch             # Partition (queue) name
#SBATCH --ntasks=1                    # Run a single task	
#SBATCH --cpus-per-task=6             # Number of CPU cores per task
#SBATCH --mem=4gb                     # Job memory request
#SBATCH --time=02:00:00               # Time limit hrs:min:sec
#SBATCH --output=mctest.%j.out        # Standard output log
#SBATCH --error=mctest.%j.err         # Standard error log

#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=username@uga.edu  # Where to send mail (change username@uga.edu to your email address)

cd $SLURM_SUBMIT_DIR

export OMP_NUM_THREADS=6  

module load foss/2022a  # load the appropriate module file, e.g. foss/2022a

time ./a.out

High Memory Job

Sample job submission script (sub.sh) to run a velvet application that needs to use 200GB of memory and 4 threads:

#!/bin/bash
#SBATCH --job-name=highmemtest        # Job name
#SBATCH --partition=highmem_p         # Partition (queue) name
#SBATCH --ntasks=1                    # Run a single task	
#SBATCH --cpus-per-task=4             # Number of CPU cores per task
#SBATCH --mem=200gb                   # Job memory request
#SBATCH --time=02:00:00               # Time limit hrs:min:sec
#SBATCH --output=highmemtest.%j.out   # Standard output log
#SBATCH --error=highmemtest.%j.err    # Standard error log

#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=username@uga.edu  # Where to send mail (change username@uga.edu to your email address)


cd $SLURM_SUBMIT_DIR

export OMP_NUM_THREADS=4

module load Velvet

velvetg [options]

Hybrid MPI/shared-memory using OpenMPI

Sample job submission script (sub.sh) to run a parallel job that uses 4 MPI processes with OpenMPI and each MPI process runs with 3 threads:

#!/bin/bash
#SBATCH --job-name=hybridtest
#SBATCH --partition=batch             # Partition (queue) name
#SBATCH --nodes=2                     # Number of nodes
#SBATCH --ntasks=8                    # Number of MPI ranks
#SBATCH --ntasks-per-node=4           # Number of MPI ranks per node
#SBATCH --cpus-per-task=3             # Number of OpenMP threads for each MPI process/rank
#SBATCH --mem-per-cpu=2000mb          # Per processor memory request
#SBATCH --time=2-00:00:00             # Walltime in hh:mm:ss or d-hh:mm:ss (2 days in the example)
#SBATCH --output=hybridtest.%j.out    # Standard output log
#SBATCH --error=hybridtest.%j.err     # Standard error log
 
#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=username@uga.edu  # Where to send mail (change username@uga.edu to your email address)


cd $SLURM_SUBMIT_DIR

module load OpenMPI/4.1.4-GCC-11.3.0

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

srun ./myhybridprog.exe

Array job

Sample job submission script (sub.sh) to submit an array job with 10 elements. In this example, each array job element will run the a.out binary using an input file called input_0, input_1, ..., input_9.

#!/bin/bash
#SBATCH --job-name=arrayjobtest       # Job name
#SBATCH --partition=batch             # Partition (queue) name
#SBATCH --ntasks=1                    # Run a single task
#SBATCH --mem=1gb                     # Job Memory
#SBATCH --time=10:00:00               # Time limit hrs:min:sec
#SBATCH --output=array_%A-%a.out      # Standard output log
#SBATCH --error=array_%A-%a.err       # Standard error log
#SBATCH --array=0-9                   # Array range

cd $SLURM_SUBMIT_DIR

module load foss/2022a # load any needed module files, e.g. foss/2022a

time ./a.out < input_$SLURM_ARRAY_TASK_ID

For more information, please see Array Jobs.

GPU/CUDA

Sample script to run Amber on a GPU node using one node, 2 CPU cores, and 1 GPU card:

#!/bin/bash
#SBATCH --job-name=amber              # Job name
#SBATCH --partition=gpu_p             # Partition (queue) name
#SBATCH --gres=gpu:A100:1                  # Requests one GPU device 
#SBATCH --ntasks=1                    # Run a single task	
#SBATCH --cpus-per-task=2             # Number of CPU cores per task
#SBATCH --mem=40gb                    # Job memory request
#SBATCH --time=10:00:00               # Time limit hrs:min:sec
#SBATCH --output=amber.%j.out         # Standard output log
#SBATCH --error=amber.%j.err          # Standard error log

#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=username@uga.edu  # Where to send mail (change username@uga.edu to your email address)

cd $SLURM_SUBMIT_DIR

ml Amber/22.0-foss-2021b-AmberTools-22.3-CUDA-11.4.1

$AMBERHOME/bin/pmemd.cuda -O -i ./prod.in -o prod.out  -p ./dimerFBP_GOL.prmtop -c ./restart.rst -r prod.rst -x prod.mdcrd

You can explicitly request a GPU device type for your job. For example:

  • To request an A100 device, use #SBATCH --gres=gpu:A100:1
  • To request an H100 device, use #SBATCH --gres=gpu:H100:1
  • To request an L4 device, use #SBATCH --gres=gpu:L4:1
  • To request a P100 device, use #SBATCH --gres=gpu:P100:1

Jobs that request a GPU, but that do not specify the device type (that is, jobs that use #SBATCH --gres=gpu:1) will get allocated any device type, some of which might not work for the application that you are running. Please check which GPU device is supported by the application or code your job is running and request the corresponding GPU device type. For more information about the GPU resources available on Sapelo2, please see https://wiki.gacrc.uga.edu/wiki/GPU and https://wiki.gacrc.uga.edu/wiki/GPU_Hardware.

Singularity job

Sample job submission script (sub.sh) to run a program (e.g. sortmerna) using a singularity container:

#!/bin/bash
#SBATCH --job-name=j_sortmerna        # Job name
#SBATCH --partition=batch             # Partition (queue) name
#SBATCH --ntasks=1                    # Run on a single CPU
#SBATCH --mem=1gb                     # Job memory request
#SBATCH --time=02:00:00               # Time limit hrs:min:sec
#SBATCH --output=sortmerna.%j.out     # Standard output log
#SBATCH --error=sortmerna.%j.err      # Standard error log
#SBATCH --cpus-per-task=4             # Number of CPU cores per task
#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=username@uga.edu  # Where to send mail (change username@uga.edu to your email address)

cd $SLURM_SUBMIT_DIR

singularity exec /apps/singularity-images/sortmerna-3.0.3.simg sortmerna \
--threads 4 --ref db.fasta,db.idx --reads file.fa --aligned base_name_output

For more information about software installed as singularity containers on the cluster, please see Software_on_Sapelo2#Singularity_Containers

To run a GPU-enabled singularity container on the GPU, please submit the job to the gpu_p partition, request a GPU device and add the --nv option to the singularity command.

Sample job submission script (sub.sh) to run a program using a singularity container (e.g. gpuapp.sif) on the GPU device:

#!/bin/bash
#SBATCH --job-name=myjobname          # Job name
#SBATCH --partition=gpu_p             # Partition (queue) name
#SBATCH --gres=gpu:1                  # Requests one GPU device 
#SBATCH --ntasks=1                    # Run on a single CPU
#SBATCH --mem=10gb                    # Job memory request
#SBATCH --time=02:00:00               # Time limit hrs:min:sec
#SBATCH --cpus-per-task=1             # Number of CPU cores per task

cd $SLURM_SUBMIT_DIR

singularity exec --nv /apps/singularity-images/gpuapp.sif prog.x  

For more information about software installed as singularity containers on the cluster, please see Software_on_Sapelo2#Singularity_Containers


Back to Top

How to submit a batch job

With the resource requirements specified in the job submission script (sub.sh), submit your job with

sbatch <scriptname>

For example

sbatch sub.sh

Once the job is submitted, the Job ID of the job (e.g. 12345) will be printed on the screen.


Back to Top

Discovering if a partition (queue) is busy

The nodes allocated to each partition (queue) and their state can be view with the command

sinfo

Sample output of the sinfo command:

PARTITION AVAIL  TIMELIMIT   NODES  STATE NODELIST 
batch*       up  7-00:00:00      1 drain* ra4-2 
batch*       up  7-00:00:00      3  down* d4-7,ra3-19,ra4-12 
batch*       up  7-00:00:00      1    mix b1-2 
batch*       up  7-00:00:00      1  alloc b1-3 
batch*       up  7-00:00:00     53   idle b1-[4-24],c1-3,c5-19,d4-[5-6,8-12],ra3-[1-18,20-24]
gpu_p        up  7-00:00:00      1    mix c4-23 
highmem_p    up  7-00:00:00      6   idle d4-[11-12],ra4-[21-24] 
inter_p      up  2-00:00:00      2   idle ra4-[16-17] 

where some common values of STATE are:

  • STATE=idle indicates that those nodes are completely free.
  • STATE=mix indicates that some cores on those nodes are in use (and some are free).
  • STATE=alloc indicates that all cores on those nodes are in use.
  • STATE=drain indicates that nodes are draining, not accepting new jobs
  • STATE=down indicates that nodes are not running or accepting new jobs

This command can be used with many options. We have configured one option that shows some quantities that are commonly of interest, including node feature defined for each node. This command is

sinfo-gacrc

You can also specify the number of characters displayed in the NODELIST column (e.g. 40) and in the AVAIL_FEATURES column (e.g. 50), with

sinfo-gacrc 40 50

Sample output of the sinfo-gacrc command:

PARTITION       NODELIST           STATE      CPUS  MEMORY   AVAIL_FEATURES        GRES       
batch*          ra4-2              drained*   32    126000   AMD,Opteron,QDR      lscratch:230         
batch*          ra3-19             down*      32    126000   AMD,Opteron,QDR      lscratch:230   
batch*          ra4-12             down*      32    126000   AMD,Opteron,QDR      lscratch:230
batch*          b1-3               mixed      64    126976   AMD,EPYC,Rome,EDR    lscratch:890     
batch*          b1-2               allocated  64    126976   AMD,EPYC,Rome,EDR    lscratch:890
batch*          b1-[4-24]          idle       64    126976   AMD,EPYC,Rome,EDR    lscratch:890    
batch*          c1-3               idle       28    59127    Intel,Broadwell,EDR  lscratch:890     
batch*          c5-19              idle       32    187868   Intel,Skylake,EDR    lscratch:890    
batch*          d4-[5-6]           idle       32    126976   AMD,EPYC,Naples,EDR  lscratch:890    
batch*          d4-[8-12]          idle       32    126976+  AMD,EPYC,Naples,EDR  lscratch:890     
batch*          ra3-[1-18,20-24]   idle       32    126000   AMD,Opteron,QDR      lscratch:230        
gpu_p           c4-23              idle       32    187868   Intel,Skylake,EDR    gpu:P100:1,lscratch:890 
highmem_p       d4-[11-12]         idle       32    514048   AMD,EPYC,Naples,EDR  lscratch:890   
highmem_p       ra4-[21-24]        idle       32    126000   AMD,Opteron,QDR      lscratch:230
inter_p         ra4-[16-17]        idle       32    126000   AMD,Opteron,QDR      lscratch:230
scavenge_p      rb7-18             idle       28    515780   Intel,Broadwell,QDR  lscratch:180

Back to Top

What is the scavenge_p partition

A portion of the Sapelo2 compute nodes were purchased by UGA PIs and their group members have priority in using those resources (also referred to as buyin nodes). The GACRC purchased the rest on UGA's behalf. The agreement for the PI-owned nodes allows "other users" to also run jobs on owned nodes, as long as those jobs don't cause that lab group to wait over two hours for access to its nodes. We have implemented a partition called scavenge_p and short jobs (for example, jobs that request less than 4h) submitted to the 'batch' partition might be automatically moved into the scavenge_p partition if the 'batch' partition is busy. This is a way to reduce the wait time of the short jobs, while making use of the buyin nodes that are not in use. Jobs running on the buyin nodes (or any nodes) cannot be dynamically migrated to other nodes, so buyin-group users might have to wait up to 4h to access their nodes, if there are jobs running in the scavenge_p partition.

Users cannot submit jobs directly to the scavenge_p partition, but if you submitted short jobs to the batch partition, you might see them running on the scavenge_p partition.


Back to Top

How to request a specific node feature

Each compute node has a set of features, such as shown with the sinfo-gacrc command above. Common features are Intel (if the node has Intel processors), AMD (if the node has AMD processors), EPYC (if the node has AMD EPYC processors), or specific EPYC processor types, such as Rome, Milan, etc. You can request using nodes with a specific feature by adding the following header line in your job submission script:

#SBATCH --constraint=featurename

where featurename needs to be replaced by the feature you want to use. For example, to request that the job goes to a node that has a Milan processor, use

#SBATCH --constraint=Milan

Back to Top

How to run Intel- or AMD-specific applications

Most of the applications that GACRC installs centrally can be run on Intel and on AMD processors, but some exceptions do exist. Also, some third-party applications that you are using might have been pre-compiled for a given processor type and would fail if run on a different processor architecture If an application that you are using if only compatible with one type of processor (e.g. Intel), you can request that node feature by adding the following line in your job submission script

#SBATCH --constraint=Intel

or

#SBATCH --constraint=EPYC

or

#SBATCH --constraint=Milan

Back to Top

How to run a job using the local scratch /lscratch on a compute node

The IO performance of the local scratch file system /lscratch is much faster than the IO performance of the network file system /scratch. Please note that the local scratch file system can only be used for running single-node jobs, i.e., single-core jobs or multi-thread jobs. In general, MPI parallel jobs that use more than one node cannot use the local scratch file system. Detailed information and instructions about /lscratch can be found at Disk_Storage#lscratch_file_system .

To use /lscratch to run a batch job, you need a few additional steps in your job submission script to ask your job to:

  1. Create a job working folder in /lscratch on the compute node where your job is dispatched
  2. Copy any input files required to run the job from your current working space, e.g., /scratch/MyID, to the folder created in step 1
  3. Change directory from your current working space /scratch/MyID to the folder created in step 1 and run the software from there, i.e. from the local scratch file system /lscratch
  4. Copy output results from /lscratch back to your /scratch/MyID, before job finishes and exits from the node
  5. Clean up in /lscratch, before job finishes and exits from the node

To use /lscratch to run a batch job, you also need to:

1. Make sure that your job will use a single node by using the following line in your job submission script:

#SBATCH --nodes=1

2. Request an appropriate amount of disk storage from the local scratch file system by adding the following line in your job submission script:

#SBATCH --gres=lscratch:200

The above header requests 200GB local storage on the compute node where your job is dispatched.

Below is a sample job submission script (sub.sh) to run a batch job using /lscratch:

#!/bin/bash
#SBATCH --job-name=RM_job
#SBATCH --partition=batch
#SBATCH --nodes=1
#SBATCH --gres=lscratch:200
#SBATCH --ntasks=12
#SBATCH --mem=36G
#SBATCH --time=7-00:00:00
#SBATCH --output=log.%j.out
#SBATCH --error=log.%j.err

cd $SLURM_SUBMIT_DIR

# Step 1
mkdir -p /lscratch/${USER}/${SLURM_JOB_ID}

# Step 2
cp ./Hawaii_H3_Final.fa /lscratch/${USER}/${SLURM_JOB_ID}

# Step 3
cd /lscratch/${USER}/${SLURM_JOB_ID}

ml RepeatModeler/2.0.4-foss-2022a

BuildDatabase -name E4 -engine ncbi Hawaii_H3_Final.fa
RepeatModeler -engine ncbi -pa 3 -database E4 > E4-repeat.out

# Step 4
cp ./E4* ${SLURM_SUBMIT_DIR}
cp -r ./RM_* ${SLURM_SUBMIT_DIR}
 
# Step 5
rm -rf /lscratch/${USER}/${SLURM_JOB_ID}

Then submit sub.sh from your current working space /scratch/MyID with:

sbatch sub.sh 

Since you submit the job from /scratch/MyID, the value stored in SLURM_SUBMIT_DIR in the above sub.sh will be /scratch/MyID.

To learn the total amount of local disk storage installed in compute nodes on Sapelo2, you can use sinfo-gacrc command. The GRES column reported is the information about the total amount of local disk storage in GB, for example, lscratch:890 means total 890GB local disk storage is installed in the compute node(s). Detailed instructions about gacrc-sinfo can be found at Running_Jobs_on_Sapelo2#Discovering_if_a_partition_.28queue.29_is_busy


Back to Top

How to open an interactive session

An interactive session on a compute node can be started with the command

interact

This command, invoked without any arguments, will start an interactive session with one core on one of the interactive nodes, and allocate 2GB of memory for a maximum walltime of 12h. It is equivalent to the qlogin command that we used previously, and it runs

srun --pty  --cpus-per-task=1 --job-name=interact --ntasks=1 --nodes=1 --partition=inter_p --time=12:00:00 --mem=2GB /bin/bash -l

When the interact command is run, it will echo the equivalent srun command, so you can easily check the resources associated to your interactive session.

The interact command takes arguments that allow you to request cores, memory, walltime limit, specific node features, or a different partition and other resources.

The options that can be used with interact are diplayed when this command is run with the -h or --help option:

[shtsai@ss-sub2 ~]$ interact -h

Usage: interact [OPTIONS]

Description: Start an interactive job

    -c, --cpus-per-task         CPU cores per task (default: 1)
    -J, --job-name              Job name (default: interact)
    -n, --ntasks                Number of tasks (default: 1)
    -N, --nodes             	Number of nodes (default: 1)
    -p, --partition             Partition for interactive job (default: inter_p)
    -q, --qos               	Request a quality of service for the job.
    -t, --time              	Maximum run time for interactive job (default: 12:00:00)
    -w, --nodelist              List of node name(s) on which your job should run
    --constraint                Job constraints
    --gres                  	Generic consumable resources
    --mem                  	Memory per node (default 2GB)
    --shell                 	Absolute path to the shell to be used in your interactive job (default: /bin/bash)
    --wckey                 	Wckey to be used with job
    --x11                   	Start an interactive job with X Forwarding
    -h, --help              	Display this help output

Examples:

To start an interactive session with 4 cores and 10GB of memory:

interact -c 4 --mem=10G

To start an interactive session with 1 core, 10GB of memory and a walltime limit of 18 hours:

interact --mem=10G --time=18:00:00

To start an interactive session with 1 core, 2GB of memory, on a node that has an AMD EPYC Milan processor in the batch partition:

interact --constraint=Milan -p batch

To start an interactive session with 1 core, 50GB of memory, and a A100 GPU device:

interact -p gpu_p --gres=gpu:A100:1 --mem=50G

Back to Top

How to run an interactive job with Graphical User Interface capabilities

A number of software installed on GACRC clusters have X Window (GUI) front ends. Examples of such applications are Matlab, Mathematica, some text editors and debuggers, etc. The best way to run such applications is using the Open OnDemand (OOD) interface to Sapelo2, either by running an interactive application in OOD or by starting an X Desktop session on the cluster and running the application therein. More information is available at OnDemand.

If using OnDemand is not an option, and you want to run an application as an interactive job and have its graphical user interface displayed on the terminal of your local machine, you need to enable X-forwarding when you ssh into the login node. For information on how to do this on windows and mac, please see instructions within questions 5.4 and 5.5 in the Frequently Asked Questions page. This can be done on a Linux machine by simply adding the -X option when ssh-ing into Sapelo2.

After setting up an X-forwarding terminal on your local machine, start an interactive session, but add the option --x11 to the interact command.

An interactive session on a compute node, with X forwarding enabled, can be started with the command

interact --x11

This command will start an interactive session, with X forwarding enabled, with one core on one of the interactive nodes, and allocate 2GB of memory for a maximum walltime of 12h.

The interact --x11 command is an alias for

srun --pty --x11 --cpus-per-task=1 --job-name=interact --ntasks=1 --nodes=1 --partition=inter_p --time=12:00:00 --mem=2GB /bin/bash -l

The options available to interact, described in the previous section, can be used along with the --x11 option.


Back to Top


How to check on running or pending jobs

To list all running and pending jobs (by all users), use the command

squeue

or

squeue -l

This command can be used with many options. We have wrapper to this command, called sq that shows some quantities that are commonly of interest. To use the sq command to list all of your running and pending jobs, use

sq --me

For detailed information on how to monitor your jobs, please see Monitoring Jobs on Sapelo2.


Back to Top

How to cancel (delete) a running or pending job

To cancel one of your running or pending job, use the command

scancel <jobid>

For example, to cancel a job with Job ID 12345 use

scancel 12345

To cancel all of your jobs, use the command

scancel -u MyID

To cancel all of your pending jobs, use the command

scancel -t PENDING -u MyID

To cancel one or more jobs by job name, use the command

scancel --name <myJobName>

To cancel an element (index) of an array job

scancel <jobid>_<index>

For example, to cancel array job element 4 of an array job whose Job ID is 12345 use

scancel 12345_4

Back to Top

How to check resource utilization of a running or finished job

The following command can be used to show resource utilization by a running job or a job that has already completed:

sacct

This command can be used with many options. We have configured one option that shows some quantities that are commonly of interest, including the amount of memory used and the cputime used by the jobs:

sacct-gacrc

For detailed information on how to monitor your jobs, please see Monitoring Jobs on Sapelo2.


Back to Top