Array Jobs: Difference between revisions

From Research Computing Center Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 46: Line 46:
</pre>
</pre>


with ${SLURM_ARRAY_TASK_ID} being replaced by one of the numbers in the range defined in the --array Slurm header.  This ensures each array job element is working one unique input file, but as you can see, this does require naming your files a certain way to leverage the SLURM_ARRAY_TASK_ID environment variable, which may not always be desirable.
with ${SLURM_ARRAY_TASK_ID} being replaced by one of the numbers in the range defined in the --array Slurm header.  This ensures each array job element is working one unique input file
 
 
If you wanted to, you don't necessarily have to have the #SBATCH --array range be a contiguous range of numbers.  For example, in the following submission script we define the array element indexes to be 1, 3, and 5:
 
<pre class="gscript">
#!/bin/bash
 
#SBATCH --job-name=array-test
#SBATCH --ntasks=1
#SBATCH --partition=batch
#SBATCH --mem=20gb
#SBATCH --time=1:00:00
#SBATCH --array=1,3,5
 
ml R/4.0.0-foss-2019b
 
Rscript myScript.R myinput-${SLURM_ARRAY_TASK_ID}
</pre>
 
Submitting the above script gives us the following three jobs as seen from <code>squeue --me</code>:
 
<pre class="gcommand">
bc06026@b1-24 arraytest$ sbatch sub.sh
Submitted batch job 3379775
bc06026@b1-24 arraytest$ squeue --me
            JOBID PARTITION    NAME    USER ST      TIME  NODES NODELIST(REASON)
        3379775_1    batch array-te  bc06026  R      0:09      1 c4-9
        3379775_3    batch array-te  bc06026  R      0:09      1 c4-9
        3379775_5    batch array-te  bc06026  R      0:09      1 c4-9
</pre>
 
This can also be achieved by defining the #SBATCH --array range with step syntax, such as 1-5:2.  Defining the index range that way would start from 1, go up to 5, counting up by 2.


===Non-Numbered Input Files===
===Non-Numbered Input Files===
Line 77: Line 109:
</pre>
</pre>


In the above submission script, ${SLURM_ARRAY_TASK_ID} is being replaced by one of the integers in the range defined by #SBATCH --array, for each array job element.  The <code>awk</code> built-in variable NR is getting the content of the line number given of input.lst.  This ensures each array job element gets a single unique input file.  One way that you can quickly create a file containing a list of your input files would be to redirect the output of <code>ls</code>.  For example, say you had 100+ .csv's in your current directory that you wanted to spread across an array job.  You could create your input list file with the command <code>ls *.csv > input.lst</code>.
In the above submission script, ${SLURM_ARRAY_TASK_ID} is being replaced by one of the integers in the range defined by #SBATCH --array, for each array job element.  The <code>awk</code> built-in variable NR is getting the content of the line number given of input.lst.  This ensures each array job element gets a single unique input file.  Please note that this method would require you not start your array index range at 0, as <code>awk</code> will start counting line numbers from 1. 
 
One way that you can quickly create a file containing a list of your input files would be to redirect the output of <code>ls</code>.  For example, say you had 100+ .csv's in your current directory that you wanted to spread across an array job.  You could create your input list file with the command <code>ls *.csv > input.lst</code>.


==Further Reading==
==Further Reading==


For more information on Slurm array jobs, please see Slurm's [https://slurm.schedmd.com/job_array.html documentation].
For more information on Slurm array jobs, please see Slurm's [https://slurm.schedmd.com/job_array.html documentation].

Revision as of 17:00, 17 June 2021

Introduction

An array job is a collection of jobs (called array job "elements") initiated from a single submission script. Array jobs work well for problems that are embarassingly parallel, meaning a problem can be easily split up into concurrently running tasks that are not dependent on one another. Imagine you have 10 input files that you want to perform the same action(s) against. Rather than looping through the input one at a time, or rather than writing 10 almost identical submission scripts, you could write and submit one array job submission script.

Example Submission Scripts

Numbered Input Files

Writing an array job submission script is hardly different from any other type Slurm submission script. The two key things to remember are the Slurm array header (#SBATCH --array), and the SLURM_ARRAY_TASK_ID environment variable. Below is an array job submission script in which there are 5 input files to be ran as arguments for myScript.R, assuming the input files were named myinput-1, myinput-2, myinput-3, etc...

#!/bin/bash

#SBATCH --job-name=array-test
#SBATCH --partition=batch
#SBATCH --ntasks=1
#SBATCH --mem=20gb
#SBATCH --time=1:00:00
#SBATCH --array=1-5

ml R/4.0.0-foss-2019b

Rscript myScript.R myinput-${SLURM_ARRAY_TASK_ID}

Submitting the above script would create five array job elements as shown below:

bc06026@b1-24 arraytest$ sbatch sub.sh
Submitted batch job 3341751
bc06026@b1-24 arraytest$ squeue --me
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
         3341751_1     batch array-te  bc06026  R       0:08      1 c4-9
         3341751_2     batch array-te  bc06026  R       0:08      1 c4-21
         3341751_3     batch array-te  bc06026  R       0:08      1 c4-21
         3341751_4     batch array-te  bc06026  R       0:08      1 c4-21
         3341751_5     batch array-te  bc06026  R       0:08      1 c4-11

As you can see in the squeue --me output, by submitting this one submission script, we have 5 jobs running concurrently. Each one of these jobs is allocated the resources requested in the submission script and is running the commands:

ml R/4.0.0-foss-2019b

Rscript myScript.R myinput-${SLURM_ARRAY_TASK_ID}

with ${SLURM_ARRAY_TASK_ID} being replaced by one of the numbers in the range defined in the --array Slurm header. This ensures each array job element is working one unique input file.


If you wanted to, you don't necessarily have to have the #SBATCH --array range be a contiguous range of numbers. For example, in the following submission script we define the array element indexes to be 1, 3, and 5:

#!/bin/bash

#SBATCH --job-name=array-test
#SBATCH --ntasks=1
#SBATCH --partition=batch
#SBATCH --mem=20gb
#SBATCH --time=1:00:00
#SBATCH --array=1,3,5

ml R/4.0.0-foss-2019b

Rscript myScript.R myinput-${SLURM_ARRAY_TASK_ID}

Submitting the above script gives us the following three jobs as seen from squeue --me:

bc06026@b1-24 arraytest$ sbatch sub.sh
Submitted batch job 3379775
bc06026@b1-24 arraytest$ squeue --me
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
         3379775_1     batch array-te  bc06026  R       0:09      1 c4-9
         3379775_3     batch array-te  bc06026  R       0:09      1 c4-9
         3379775_5     batch array-te  bc06026  R       0:09      1 c4-9

This can also be achieved by defining the #SBATCH --array range with step syntax, such as 1-5:2. Defining the index range that way would start from 1, go up to 5, counting up by 2.

Non-Numbered Input Files

Sometimes it will make more sense for the names of your input files to not have a numbered naming scheme. In this scenario input files can be mapped to a SLURM_ARRAY_TASK_ID by creating a separate file in your working directory, containing just a list of your input files, one file name per line. Then awk can be used to map each file name's line number to SLURM_ARRAY_TASK_ID. For example, say have three files we want to distribute among three array job elements. We can create a file called input.lst, listing each input file, one line at a time:

testdata.txt
aninputfile.txt
moredata.txt

Then in our submission script, we can reference each file name like this:

#!/bin/bash

#SBATCH --job-name=array-test
#SBATCH --ntasks=1
#SBATCH --partition=batch
#SBATCH --mem=20gb
#SBATCH --time=1:00:00
#SBATCH --array=1-3

ml R/4.0.0-foss-2019b

file=$(awk "NR==${SLURM_ARRAY_TASK_ID}" input.lst)

Rscript myScript.R $file

In the above submission script, ${SLURM_ARRAY_TASK_ID} is being replaced by one of the integers in the range defined by #SBATCH --array, for each array job element. The awk built-in variable NR is getting the content of the line number given of input.lst. This ensures each array job element gets a single unique input file. Please note that this method would require you not start your array index range at 0, as awk will start counting line numbers from 1.

One way that you can quickly create a file containing a list of your input files would be to redirect the output of ls. For example, say you had 100+ .csv's in your current directory that you wanted to spread across an array job. You could create your input list file with the command ls *.csv > input.lst.

Further Reading

For more information on Slurm array jobs, please see Slurm's documentation.