Running Jobs on Sapelo2: Difference between revisions

Revision as of 14:34, 20 October 2020

This page is being written in preparation for switching the queueing system on Sapelo2 to Slurm, it is not applicable to Sapelo2 yet.

This page is applicable to the Slurm development cluster (Sap2test)

If you are current Sapelo2 users, please refer to Running Jobs on Sapelo2 for instructions on how to run jobs on Sapelo2.

We expect all current users to spend some time getting familiar with the new environment so as to reduce the amount of support queries after the switchover. Once Sapelo2 migrates to Slurm and the new cluster environment, job submission scripts and workflows based on Torque/Moab and the current software packages installed on Sapelo2 will not work anymore.

Using the Queueing System

The login node for the Sap2test cluster should be used for text editing, and job submissions. No jobs should be run directly on the login node. Processes that use too much CPU or RAM on the login node may be terminated by GACRC staff, or automatically, in order to keep the cluster running properly. Jobs should be run using the Slurm queueing system. The queueing system should be used to run both interactive and batch jobs.

Back to Top

Batch partitions (queues) defined on the Sap2test

There are different partitions defined on Sap2test. The Slurm queueing system refers to queues as partition. Users are required to specify, in the job submission script or as job submission command line arguments, the partition and the resources needed by the job in order for it to be assigned to compute node(s) that have enough available resources (such as number of cores, amount of memory, GPU cards, etc). Please note, Slurm will not allow a job to be submitted if there are no resources matching your request. Please refer to Migrating from Torque to Slurm for more info about Slurm queueing system.

The following partitions are defined on the Sap2test cluster:

Partition Name	Time limit	Max jobs	Notes
batch	7 days		Regular nodes.
batch-30d	30 days	2	Regular nodes. A given user can have up to one job running at a time here, plus one pending, or two pending and none running. A user's attempt to submit a third job into this partition will be rejected.
highmem_p			For high memory jobs
gpu_p	7 days		For GPU-enabled jobs.
gpu_30d_p	30 days	2	For GPU-enabled jobs. A given user can have up to one job running at a time here, plus one pending, or two pending and none running. A user's attempt to submit a third job into this partition will be rejected.
inter_p			Regular nodes, for interactive jobs.
name_p			Partitions that target different groups' buy-in nodes. The name string is specific to each group.

The table below summarizes the partitions (queues) defined and the compute nodes that they target:

Partition Name	Node Features	Node Number	Description	Memory for jobs	Notes
batch, batch_30d	AMD, Opteron, QDR	40	48-core, 128GB RAM, AMD Opteron, QDR IB interconnect	122GB	Regular nodes.
batch, batch_30d	AMD, EPYC, EDR	24	64-core, 128GB RAM, AMD EPYC, IB EDR interconnect	120GB	Regular nodes
batch, batch_30d	AMD, EPYC, EDR	6	32-core, 128GB RAM, AMD EPYC, IB EDR interconnect	120GB	Regular nodes
batch, batch_30d	AMD, Opteron, QDR	4	48-core, 256GB RAM, AMD Opteron, QDR IB interconnect	250GB	Regular nodes.
batch, batch_30d	Intel, Skylake, EDR	1	32-core, 192GB RAM, Intel Xeon Skylake, IB EDR interconnect	180GB	Regular nodes
batch, batch_30d	Intel, Broadwell, EDR	1	28-core, 64GB RAM, Intel Xeon Broadwell, IB EDR interconnect	58GB	Regular nodes
highmem_p	AMD, Opteron, QDR	4	48-core, 512GB, AMD Opteron, IB QDR interconnect	500GB	For high memory jobs
highmem_p	AMD, EPYC, EDR	2	32-core, 512GB RAM, AMD EPYC, IB EDR interconnect	490GB	For high memory jobs
gpu_p, gpu_30d_p	GPU, P100, EDR	1	32-core, 192GB RAM, Intel Xeon Skylake, 1 NVIDIA P100 GPUs, EDR IB interconnect	180GB	For GPU-enabled jobs.

You can check all partitions (queues) defined in the cluster with the command

sinfo

Back to Top

Job submission Scripts

Users are required to specify the number of cores, the amount of memory, the partition (queue) name, and the maximum wallclock time needed by the job.

Header lines

Basic job submission script

At a minimum, the job submission script needs to have the following header lines:

#!/bin/bash
#SBATCH --partition=batch
#SBATCH --job-name=test
#SBATCH --ntasks=1
#SBATCH --time=4:00:00
#SBATCH --mem=10G

Commands to run your application should be added after these header lines.

Header lines explained:

#!/bin/bash: specify Linux default shell bash
#SBATCH --partition=batch : specify the partition (queue) to run on, e.g. batch
#SBATCH --job-name=test : specify the job name, e.g. test
#SBATCH --ntasks=1 : specify the number of tasks (e.g. 1)
#SBATCH --time=4:00:00 : specify the maximum walltime of the job in the format D-HH:MM:SS (e.g. --time=1- for one day or --time=4:00:00 for 4 hours)
#SBATCH --mem=10G : specify the maximum memory per node required by the job (e.g. 10GB)

Below are some of the most commonly used queueing system options to configure the job.

Options to request resources for the job

-t, --time=time

   Wall clock time limit of a job running on cluster. Acceptable formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes", and "days-hours:minutes:seconds".

--mem=num

   Maximum amount of memory in MegaBytes per node required by the job. Different units can be specified using the suffix [K|M|G|T].

--mem-per-cpu=num

   Minimum amount of memory in MegaBytes per allocated CPU. Different units can be specified using the suffix [K|M|G|T].

-n, --ntasks=num

   Number of tasks to run. The default is one task per node. For use with distributed parallelism. See below.

-N, --nodes=num

   Number of nodes allocated to the job. Default is one node.

--ntasks-per-node=num

   Number of tasks invoked on each node. Meant to be used with the --nodes option. For use with distributed parallelism. See below.

-c, --cpus-per-task=ncpus

   Number of CPUs allocated to each task. For use with shared memory parallelism. See below.

-C, --constraint=<list>

   List of node features required by the job.  Only nodes having features matching the job constraints will be used to satisfy the request.  Multiple constraints may be specified with AND, OR, matching OR, resource  counts,  etc.

--gres=<list>

   A comma  delimited  list  of  generic  consumable  resources. For example, to request one P100 GPU card: --gres=gpu:P100:1

Please try to request resources for your job as accurately as possible, because this allows your job to be dispatched to run at the earliest opportunity and it helps the system allocate resources efficiently to start as many jobs as possible, benefiting all users.

Options to manage job notification and output

-J, --job-name jobname

   Specify a name for the job. The specified name will appear along with the job id number when querying running jobs on the system. The default is the supplied executable program's name. Within the job, $SBATCH_JOB_NAME expands to the job name.

-o, --output=path/for/stdout

   Send stdout to path/for/stdout. The default filename is slurm-${SLURM_JOB_ID}.out, e.g. slurm-12345.out, in the directory from which the job was submitted.

-e, --error=path/for/stderr

   Send stderr to path/for/stderr. If --error is not specified, both stdout and stderr will directed to the file specified by --output.

--mail-user=username@uga.edu

   Send email notification to the address you specified when certain events occur.

--mail-type=type

   Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL, TIME_LIMIT, TIME_LIMIT_90 (reached 90 percent of time limit), TIME_LIMIT_80 and TIME_LIMIT_50.

Options to set Array Jobs

If you wish to run an application binary or script using e.g. different input files, then you might find it convenient to use an array job. To create an array job with e.g. 10 elements, use

#SBATCH -a 0-9

or

#SBATCH --array=0-9

The ID of each element in an array job, i.e., job array index value, is stored in SLURM_ARRAY_TASK_ID. SLURM_ARRAY_JOB_ID will be set to the first job ID of the array. SLURM_ARRAY_TASK_COUNT will be set to the number of tasks in the job array. SLURM_ARRAY_TASK_MAX will be set to the highest job array index value. SLURM_ARRAY_TASK_MIN will be set to the lowest job array index value. Each array job element runs as an independent job, so multiple array elements can run concurrently, if resources are available. For example:

sbatch --array=1-3 -N1 sub.sh

will generate a job array containing three jobs. If the sbatch command responds
Submitted batch job 36
then the environment variables will be set as follows:

SLURM_JOB_ID=36
SLURM_ARRAY_JOB_ID=36
SLURM_ARRAY_TASK_ID=1
SLURM_ARRAY_TASK_COUNT=3
SLURM_ARRAY_TASK_MAX=3
SLURM_ARRAY_TASK_MIN=1

SLURM_JOB_ID=37
SLURM_ARRAY_JOB_ID=36
SLURM_ARRAY_TASK_ID=2
SLURM_ARRAY_TASK_COUNT=3
SLURM_ARRAY_TASK_MAX=3
SLURM_ARRAY_TASK_MIN=1

SLURM_JOB_ID=38
SLURM_ARRAY_JOB_ID=36
SLURM_ARRAY_TASK_ID=3
SLURM_ARRAY_TASK_COUNT=3
SLURM_ARRAY_TASK_MAX=3
SLURM_ARRAY_TASK_MIN=1

Most Slurm commands recognize the SLURM_ARRAY_JOB_ID plus SLURM_ARRAY_TASK_ID values separated by an underscore as identifying an element of a job array, for example, 36_2 would be equivalent ways to identify the second array element of array job 36.

Option to set job dependency

You can set job dependency with the option -d or --dependency=dependency-list. For example, if you want to specify that one job starts to run after the job 1234 and 1235 have successfully executed (ran to completion with an exit code of zero), you can add the following header line in the job submission script of the job:

#SBATCH --dependency=afterok:1234:1235

Having this header line in the job submission script will ensure that the job is only dispatched to run after job 1234 and 1235 have completed successfully.

You can also use the following header line to specify that one job starts to run after the job 1236 and 1237 start or are cancelled:

#SBATCH --dependency=after:1236:1237

Environment Variables exported by batch jobs

When a batch job is started, a number of variables are introduced into the job's environment that can be used by the batch script in making decisions, creating output files, and so forth. Some of these variables are listed in the following table:

Variable	Description
SLURM_ARRAY_JOB_ID	Job array's master job ID number, i.e., the first Slurm job id of a job array
SLURM_ARRAY_TASK_COUNT	Total number of tasks (elements) in a job array
SLURM_ARRAY_TASK_ID	Job array ID (index) number
SLURM_ARRAY_TASK_MAX	Job array's maximum ID (index) number
SLURM_ARRAY_TASK_MIN	Job array's minimum ID (index) number
SLURM_CPUS_ON_NODE	Number of CPUS on the allocated node
SLURM_CPUS_PER_TASK	Number of cpus requested per task. Only set if the --cpus-per-task option is specified
SLURM_JOB_ID	Unique Slurm job id
SLURM_JOB_NAME	Job name
SLURM_JOB_CPUS_PER_NODE	Count of processors available to the job on this node
SLURM_JOB_NODELIST	List of nodes allocated to the job
SLURM_JOB_NUM_NODES	Total number of nodes in the job's resource allocation
SLURM_JOB_PARTITION	Name of the partition (i.e. queue) in which the job is running
SLURM_MEM_PER_NODE	Same as --mem
SLURM_MEM_PER_CPU	Same as --mem-per-cpu
SLURM_NTASKS	Same as -n, --ntasks
SLURM_NTASKS_PER_NODE	Number of tasks requested per node. Only set if the --ntasks-per-node option is specified
SLURM_SUBMIT_DIR	The directory from which sbatch was invoked
SLURM_SUBMIT_HOST	The hostname of the computer from which sbatch was invoked
SLURM_TASK_PID	The process ID of the task being started
SLURMD_NODENAME	Name of the node running the job script
CUDA_VISIBLE_DEVICES	GPU devide ID that assigned to the job to use

Back to Top

Sample job submission scripts

Serial (single-processor) Job

Sample job submission script (sub.sh) to run an R program called add.R using a single core:

#!/bin/bash
#SBATCH --job-name=testserial         # Job name
#SBATCH --partition=batch             # Partition (queue) name
#SBATCH --ntasks=1                    # Run on a single CPU
#SBATCH --mem=1gb                     # Job memory request
#SBATCH --time=02:00:00               # Time limit hrs:min:sec
#SBATCH --output=testserial.%j.out    # Standard output log
#SBATCH --error=testserial.%j.err     # Standard error log

#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=username@uga.edu  # Where to send mail	

cd $SLURM_SUBMIT_DIR

module load R/3.6.2-foss-2019b

R CMD BATCH add.R

In this sample script, the standard output and error of the job will be saved into a file called testserial.o%j, where %j will be automatically replaced by the job id of the job.

Serial (single-processor) Job on an AMD EPYC processor

Sample job submission script (sub.sh) to run an R program called add.R using a single core:

#!/bin/bash
#SBATCH --job-name=testserial         # Job name
#SBATCH --partition=batch             # Partition (queue) name
#SBATCH --constraint=EPYC             # node feature
#SBATCH --ntasks=1                    # Run on a single CPU
#SBATCH --mem=1gb                     # Job memory request
#SBATCH --time=02:00:00               # Time limit hrs:min:sec
#SBATCH --output=testserial.%j.out    # Standard output log
#SBATCH --error=testserial.%j.err     # Standard error log

#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=username@uga.edu  # Where to send mail	

cd $SLURM_SUBMIT_DIR

module load R/3.6.2-foss-2019b

R CMD BATCH add.R

In this sample script, the standard output and error of the job will be saved into a file called testserial.o%j, where %j will be automatically replaced by the job id of the job.

MPI Job

Sample job submission script (sub.sh) to run an OpenMPI application. In this example the job requests 16 cores and further specifies that these 16 cores need to be divided equally on 2 nodes (8 cores per node) and the binary is called mympi.exe:

#!/bin/bash
#SBATCH --job-name=mpitest            # Job name
#SBATCH --partition=batch             # Partition (queue) name
#SBATCH --nodes=2                     # Number of nodes
#SBATCH --ntasks=16                   # Number of MPI ranks
#SBATCH --ntasks-per-node=8           # How many tasks on each node
#SBATCH --cpus-per-task=1             # Number of cores per MPI rank 
#SBATCH --mem-per-cpu=600mb           # Memory per processor
#SBATCH --time=02:00:00               # Time limit hrs:min:sec
#SBATCH --output=mpitest.%j.out       # Standard output log
#SBATCH --error=mpitest.%j.err        # Standard error log

#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=username@uga.edu  # Where to send mail

cd $SLURM_SUBMIT_DIR

module load OpenMPI/3.1.4-GCC-8.3.0

mpirun ./mympi.exe

Please note that you need to start the application with mpirun or mpiexec, and not with srun.

MPI Job on nodes connected via the EDR IB fabric

Sample job submission script (sub.sh) to run an OpenMPI application. In this example the job requests 16 cores and further specifies that these 16 cores need to be divided equally on 2 nodes (8 cores per node) and the binary is called mympi.exe:

#!/bin/bash
#SBATCH --job-name=mpitest            # Job name
#SBATCH --partition=batch             # Partition (queue) name
#SBATCH --constraint=EDR              # node feature
#SBATCH --nodes=2                     # Number of nodes
#SBATCH --ntasks=16                   # Number of MPI ranks
#SBATCH --ntasks-per-node=8           # How many tasks on each node
#SBATCH --cpus-per-task=1             # Number of cores per MPI rank 
#SBATCH --mem-per-cpu=600mb           # Memory per processor
#SBATCH --time=02:00:00               # Time limit hrs:min:sec
#SBATCH --output=mpitest.%j.out       # Standard output log
#SBATCH --error=mpitest.%j.err        # Standard error log

#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=username@uga.edu  # Where to send mail

cd $SLURM_SUBMIT_DIR

module load OpenMPI/3.1.4-GCC-8.3.0

mpirun ./mympi.exe

Please note that you need to start the application with mpirun or mpiexec, and not with srun.

OpenMP (Multi-Thread) Job

Sample job submission script (sub.sh) to run a program that uses OpenMP with 6 threads. Please set --ntasks=1 and set --cpus-per-task to the number of threads you wish to use. The name of the binary in this example is a.out.

#!/bin/bash
#SBATCH --job-name=mctest             # Job name
#SBATCH --partition=batch             # Partition (queue) name
#SBATCH --ntasks=1                    # Run a single task	
#SBATCH --cpus-per-task=6             # Number of CPU cores per task
#SBATCH --mem=4gb                     # Job memory request
#SBATCH --time=02:00:00               # Time limit hrs:min:sec
#SBATCH --output=mctest.%j.out        # Standard output log
#SBATCH --error=mctest.%j.err         # Standard error log

#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=username@uga.edu  # Where to send mail	

cd $SLURM_SUBMIT_DIR

export OMP_NUM_THREADS=6  

module load foss/2019b  # load the appropriate module file, e.g. foss/2019b

time ./a.out

High Memory Job

Sample job submission script (sub.sh) to run a velvet application that needs to use 200GB of memory and 4 threads:

#!/bin/bash
#SBATCH --job-name=highmemtest        # Job name
#SBATCH --partition=highmem_p         # Partition (queue) name
#SBATCH --ntasks=1                    # Run a single task	
#SBATCH --cpus-per-task=4             # Number of CPU cores per task
#SBATCH --mem=200gb                   # Job memory request
#SBATCH --time=02:00:00               # Time limit hrs:min:sec
#SBATCH --output=highmemtest.%j.out   # Standard output log
#SBATCH --error=highmemtest.%j.err    # Standard error log

#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=username@uga.edu  # Where to send mail	

cd $SLURM_SUBMIT_DIR

export OMP_NUM_THREADS=4

module load Velvet

velvetg [options]

Hybrid MPI/shared-memory using OpenMPI

Sample job submission script (sub.sh) to run a parallel job that uses 4 MPI processes with OpenMPI and each MPI process runs with 3 threads:

#!/bin/bash
#SBATCH --job-name=hybridtest
#SBATCH --partition=batch             # Partition (queue) name
#SBATCH --nodes=2                     # Number of nodes
#SBATCH --ntasks=8                    # Number of MPI ranks
#SBATCH --ntasks-per-node=4           # Number of MPI ranks per node
#SBATCH --cpus-per-task=3             # Number of OpenMP threads for each MPI process/rank
#SBATCH --mem-per-cpu=2000mb          # Per processor memory request
#SBATCH --time=2-00:00:00             # Walltime in hh:mm:ss or d-hh:mm:ss (2 days in the example)
#SBATCH --output=hybridtest.%j.out    # Standard output log
#SBATCH --error=hybridtest.%j.err     # Standard error log
 
#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=username@uga.edu  # Where to send mail	

cd $SLURM_SUBMIT_DIR
 
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

mpirun ./myhybridprog.exe

Array job

Sample job submission script (sub.sh) to submit an array job with 10 elements. In this example, each array job element will run the a.out binary using an input file called input_0, input_1, ..., input_9.

#!/bin/bash
#SBATCH --job-name=arrayjobtest       # Job name
#SBATCH --partition=batch             # Partition (queue) name
#SBATCH --ntasks=1                    # Run a single task
#SBATCH --mem=1gb                     # Job Memory
#SBATCH --time=10:00:00               # Time limit hrs:min:sec
#SBATCH --output=array_%A-%a.out      # Standard output log
#SBATCH --error=array_%A-%a.err       # Standard error log
#SBATCH --array=0-9                   # Array range

cd $SLURM_SUBMIT_DIR

module load foss/2019b # load any needed module files, e.g. foss/2019b

time ./a.out < input_$SLURM_ARRAY_TASK_ID

Singularity job

Sample job submission script (sub.sh) to run a program (e.g. sortmerna) using a singularity container:

#!/bin/bash
#SBATCH --job-name=j_sortmerna        # Job name
#SBATCH --partition=batch             # Partition (queue) name
#SBATCH --ntasks=1                    # Run on a single CPU
#SBATCH --mem=1gb                     # Job memory request
#SBATCH --time=02:00:00               # Time limit hrs:min:sec
#SBATCH --output=sortmerna.%j.out     # Standard output log
#SBATCH --error=sortmerna.%j.err      # Standard error log
#SBATCH --cpus-per-task=4             # Number of CPU cores per task
#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=username@uga.edu  # Where to send mail	

cd $SLURM_SUBMIT_DIR

singularity exec /apps/singularity-images/sortmerna-3.0.3.simg sortmerna \
--threads 4 --ref db.fasta,db.idx --reads file.fa --aligned base_name_output

For more information about software installed as singularity containers on the cluster, please see Software_on_sap2test#Singularity_Containers

GPU/CUDA

Sample script to run Amber on a GPU node using one node, 2 CPU cores, and 1 GPU card:

#!/bin/bash
#SBATCH --job-name=amber              # Job name
#SBATCH --partition=gpu_p             # Partition (queue) name
#SBATCH --gres=gpu:1                  # Requests one GPU device 
#SBATCH --ntasks=1                    # Run a single task	
#SBATCH --cpus-per-task=2             # Number of CPU cores per task
#SBATCH --mem=40gb                    # Job memory request
#SBATCH --time=10:00:00               # Time limit hrs:min:sec
#SBATCH --output=amber.%j.out         # Standard output log
#SBATCH --error=amber.%j.err          # Standard error log

#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=username@uga.edu  # Where to send mail	

cd $SLURM_SUBMIT_DIR

ml Amber/18-fosscuda-2018b-AmberTools-18-patchlevel-10-8

mpiexec $AMBERHOME/bin/pmemd.cuda -O -i ./prod.in -o prod_c4-23.out  -p ./dimerFBP_GOL.prmtop -c ./restart.rst -r prod.rst -x prod.mdcrd

You can use the option #SBATCH --gres=gpu:K40:1 or #SBATCH --gres=gpu:P100:1 to specify using a K40 or a P100 GPU device, respectively. The compute mode of the GPU will be set to Default.

Back to Top

How to submit a job to the batch queue

With the resource requirements specified in the job submission script (sub.sh), submit your job with

sbatch <scriptname>

For example

sbatch sub.sh

Once the job is submitted, the Job ID of the job (e.g. 12345) will be printed on the screen.

Back to Top

Discovering if a partition (queue) is busy

The nodes allocated to each partition (queue) and their state can be view with the command

sinfo

Sample output of the sinfo command:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST 
batch*       up   08:00:00      1 drain* ra4-2 
batch*       up   08:00:00      3  down* d4-7,ra3-19,ra4-12 
batch*       up   08:00:00      1    mix b1-2 
batch*       up   08:00:00      1  alloc b1-3 
batch*       up   08:00:00     53   idle b1-[4-24],c1-3,c5-19,d4-[5-6,8-12],ra3-[1-18,20-24]
gpu_p        up   08:00:00      1    mix c4-23 
highmem_p    up   08:00:00      6   idle d4-[11-12],ra4-[21-24] 
inter_p      up   08:00:00      2   idle ra4-[16-17]

where some common values of STATE are:

STATE=idle indicates that those nodes are completely free.
STATE=mix indicates that some cores on those nodes are in use (and some are free).
STATE=alloc indicates that all cores on those nodes are in use.
STATE=drain indicates that nodes are draining, not accepting new jobs
STATE=down indicates that nodes are not running or accepting new jobs

This command can be used with many options. We have configured one option that shows some quantities that are commonly of interest, including node feature defined for each node. This command is

sinfo-gacrc

You can also specify the number of characters displayed in the NODELIST column (e.g. 40) and in the AVAIL_FEATURES column (e.g. 50), with

sinfo-gacrc 40 50

Sample output of the sinfo-gacrc command:

PARTITION       NODELIST           STATE      CPUS  MEMORY   AVAIL_FEATURES        GRES       
batch*          ra4-2              drained*   32    126000   AMD,Opteron,QDR      (null)         
batch*          ra3-19             down*      32    126000   AMD,Opteron,QDR      (null)     
batch*          ra4-12             down*      32    126000   AMD,Opteron,QDR      (null)     
batch*          b1-3               mixed      64    126976   AMD,EPYC,EDR         (null)     
batch*          b1-2               allocated  64    126976   AMD,EPYC,EDR         (null)     
batch*          b1-[4-24]          idle       64    126976   AMD,EPYC,EDR         (null)     
batch*          c1-3               idle       28    59127    Intel,Broadwell,EDR  (null)     
batch*          c5-19              idle       32    187868   Intel,Skylake,EDR    (null)     
batch*          d4-[5-6]           idle       32    126976   AMD,EPYC,EDR         (null)     
batch*          d4-[8-12]          idle       32    126976+  AMD,EPYC,EDR         (null)     
batch*          ra3-[1-18,20-24]   idle       32    126000   AMD,Opteron,QDR      (null)        
gpu_p           c4-23              idle       32    187868   Intel,Skylake,EDR    gpu:P100:1 
highmem_p       d4-[11-12]         idle       32    514048   AMD,EPYC,EDR         (null)     
highmem_p       ra4-[21-24]        idle       32    126000   AMD,Opteron,QDR      (null)     
inter_p         ra4-[16-17]        idle       32    126000   AMD,Opteron,QDR      (null)

Back to Top

How to open an interactive session

An interactive session on a compute node can be started with the command

qlogin

This command will start an interactive session with one core on one of the interactive nodes, and allocate 2GB of memory for a maximum walltime of 12h.

The qlogin command is an alias for

srun --pty  -p inter_p  --mem=2G --nodes=1 --ntasks-per-node=1 --time=12:00:00 --job-name=qlogin --export=TERM /bin/bash

Back to Top

How to run an interactive job with Graphical User Interface capabilities

If you want to run an application as an interactive job and have its graphical user interface displayed on the terminal of your local machine, you need to enable X-forwarding when you ssh into the login node. For information on how to do this, please see questions 10 and 11 in the Frequently Asked Questions page.

Then start an interactive session, but add the option -x11 to the srun command.

An interactive session on a compute node, with X forwarding enabled, can be started with the command

xqlogin

This command will start an interactive session, with X forwarding enabled, with one core on one of the interactive nodes, and allocate 2GB of memory for a maximum walltime of 12h.

The xqlogin command is an alias for

srun --pty  -p inter_p  --mem=2G --nodes=1 --ntasks-per-node=1 --time=12:00:00 --x11 --job-name=xqlogin --export=TERM,DISPLAY /bin/bash

Back to Top

How to check on running or pending jobs

To list all running and pending jobs (by all users), use the command

squeue

or

squeue -l

For detailed information on how to monitor your jobs, please see Monitoring Jobs on Sap2test.

Back to Top

How to cancel (delete) a running or pending job

To cancel one of your running or pending job, use the command

scancel <jobid>

For example, to cancel a job with Job ID 12345 use

scancel 12345

To cancel all of your jobs, use the command

scancel -u MyID

To cancel all of your pending jobs, use the command

scancel -t PENDING -u MyID

To cancel one or more jobs by job name, use the command

scancel --name <myJobName>

To cancel an element (index) of an array job

scancel <jobid>_<index>

For example, to cancel array job element 4 of an array job whose Job ID is 12345 use

scancel 12345_4

Back to Top

How to check resource utilization of a running or finished job

The following command can be used to show resource utilization by a running job or a job that has already completed:

sacct

This command can be used with many options. We have configured one option that shows some quantities that are commonly of interest, including the amount of memory used and the cputime used by the jobs:

sacct-gacrc

For detailed information on how to monitor your jobs, please see Monitoring Jobs on Sap2test.

Back to Top

@@ Line 46: / Line 46: @@
 | gpu_p || 7 days ||  || For GPU-enabled jobs.
 |-
-| gpu_30d_p || 30 days ||  || For GPU-enabled jobs. A given user can have up to one job running at a time here, plus one pending, or two pending and none running. A user's attempt to submit a third job into this partition will be rejected.
+| gpu_30d_p || 30 days || 2 || For GPU-enabled jobs. A given user can have up to one job running at a time here, plus one pending, or two pending and none running. A user's attempt to submit a third job into this partition will be rejected.
 |-
 | inter_p ||  ||  || Regular nodes, for interactive jobs.

Running Jobs on Sapelo2: Difference between revisions

Revision as of 14:34, 20 October 2020

Contents

Using the Queueing System

Batch partitions (queues) defined on the Sap2test

Job submission Scripts

Header lines

Options to request resources for the job

Options to manage job notification and output

Options to set Array Jobs

Option to set job dependency

Other content of the script

Environment Variables exported by batch jobs

Sample job submission scripts

Serial (single-processor) Job

Serial (single-processor) Job on an AMD EPYC processor

MPI Job

MPI Job on nodes connected via the EDR IB fabric

OpenMP (Multi-Thread) Job

High Memory Job

Hybrid MPI/shared-memory using OpenMPI

Array job

Singularity job

GPU/CUDA

How to submit a job to the batch queue

Discovering if a partition (queue) is busy

How to open an interactive session

How to run an interactive job with Graphical User Interface capabilities

How to check on running or pending jobs

How to cancel (delete) a running or pending job

How to check resource utilization of a running or finished job

Navigation menu

Running Jobs on Sapelo2: Difference between revisions

Revision as of 14:34, 20 October 2020

Using the Queueing System

Batch partitions (queues) defined on the Sap2test

Job submission Scripts

Header lines

Options to request resources for the job

Options to manage job notification and output

Options to set Array Jobs

Option to set job dependency

Other content of the script

Environment Variables exported by batch jobs

Sample job submission scripts

Serial (single-processor) Job

Serial (single-processor) Job on an AMD EPYC processor

MPI Job

MPI Job on nodes connected via the EDR IB fabric

OpenMP (Multi-Thread) Job

High Memory Job

Hybrid MPI/shared-memory using OpenMPI

Array job

Singularity job

GPU/CUDA

How to submit a job to the batch queue

Discovering if a partition (queue) is busy

How to open an interactive session

How to run an interactive job with Graphical User Interface capabilities

How to check on running or pending jobs

How to cancel (delete) a running or pending job

How to check resource utilization of a running or finished job

Navigation menu

Search