Monitoring Jobs on Sapelo2: Difference between revisions
No edit summary |
No edit summary |
||
(9 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
__TOC__ | |||
= Pending or Running Jobs = | |||
< | ==<code>squeue</code> and <code>sq</code>== | ||
</ | |||
The easiest way to monitor pending or running jobs is with the Slurm <code>squeue</code> command. Like most Slurm commands, you are able to control the columns displayed in the output of this command (see <code>man squeue</code> for more information). To save you the trouble of typing a long format string and to make things more convenient, we've created the <code>sq</code> command, which is <code>squeue</code> but pre-formatted and with some additional options for convenience. | |||
< | The key thing to remember about <code>squeue</code>/<code>sq</code> is that without any options, it shows ALL currently running and pending jobs on the cluster. In order to show only your currently running and pending jobs, you will want to use the <code>--me</code> option. | ||
</ | |||
The default <code>squeue</code> columns are as follows: | |||
<pre class="gcomment"> | |||
<pre class=" | JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) | ||
</pre> | </pre> | ||
Using <code>sq</code> runs the <code>squeue</code> command but provides the following columns: | |||
<pre class="gcomment"> | <pre class="gcomment"> | ||
JOBID NAME PARTITION USER NODES CPUS MIN_MEMORY PRIORITY TIME TIME_LIMIT STATE NODELIST(REASON) | |||
</pre> | </pre> | ||
' | As you can see, you're able to get much more useful information with <code>sq</code> than with just the default <code>squeue</code> formatting. | ||
'''Output Columns Explained''' | |||
The | * '''JOBID''': The unique ID of the job (for an array job, it will be of the form "<base_job_id>_<index>"). | ||
* '''NAME''': The name of the job. If not specified in one's submission script, it will default to the name of the submission script (e.g. "sub.sh"). | |||
* '''PARTITION''': The partition to which the job was sent (e.g. batch, highmem_p, gpu_p, etc...). | |||
* '''USER''': The user who submitted the job. | |||
* '''NODES''': The number of nodes allocated to the job. | |||
* '''CPUS''': The number of CPU cores allocated to the job. | |||
* '''MIN_MEMORY''': The total amount of memory allocated to the job. | |||
* '''PRIORITY''': The job's priority per Slurm's [https://slurm.schedmd.com/priority_multifactor.html Multifactor Priority Plugin] | |||
* '''TIME''': How much (wall-clock) time has elapsed since the job started, in the format DAYS-HOURS:MINUTES:SECONDS | |||
* '''TIME_LIMIT''': The maximum time given for the job to run, in the format DAYS-HOURS:MINUTES:SECONDS. | |||
* '''STATE''': The job's state (e.g. Running, Pending, etc...) | |||
* '''NODELIST(REASON)''': The name of the node(s) on which the job is running or the reason the job has not started yet, if it is pending. | |||
<code>sq</code> also has a -h/--help option: | |||
<pre class="gcomment"> | |||
bc06026@ss-sub3 ~$ sq --help | |||
Usage: sq [OPTIONS] | |||
Descriptions: sq - preformatted wrapper for squeue. See man squeue for more information. | |||
-j Displays squeue output for a given job | |||
--me Displays squeue output for the user executing this command | |||
-p Displays squeue output for a given partition | |||
-u Displays squeue output for a given user | |||
-T Displays submit and start time columns | |||
-h, --help Displays this help output | |||
</pre> | </pre> | ||
<big><big>'''Examples'''</big></big> | |||
* See all pending and running jobs: <code>sq</code> | |||
* See all of your pending and running jobs: <code>sq --me</code> | |||
* See all pending and running jobs in the highmem_p: <code>sq -p highmem_p</code> | |||
* See all of your pending and running jobs in the batch partition: <code>sq --me -p batch</code> | |||
* See all of your pending and running jobs including submit time and start time columns: <code>sq --me -T</code> (Note, this will require a wide monitor or small font to display without columns wrapping around) | |||
<big>'''Example <code>sq</code> output:'''</big> | |||
= | <pre class="gcomment"> | ||
bc06026@ss-sub3 ~$ sq | |||
JOBID NAME PARTITION USER NODES CPUS MIN_MEMORY PRIORITY TIME TIME_LIMIT STATE NODELIST(REASON) | |||
4581410 Bowtie2-test batch zp21982 1 1 12G 6003 2:10:56 10:00:00 RUNNING c5-4 | |||
4584815 test-job highmem_p rt12352 1 12 300G 5473 1:51:03 2:00:00 RUNNING d3-9 | |||
4578428 PR6_Cd3 batch un12354 1 1 40G 5449 4:57:15 1-2:00:00 RUNNING c4-16 | |||
4583491 interact inter_p ai38821 1 4 2G 5428 1:57:38 12:00:00 RUNNING d5-21 | |||
4580374 BLAST batch gh98762 1 1 10G 5397 2:54:41 12:00:00 RUNNING b1-9 | |||
... | |||
</pre> | </pre> | ||
<big>'''Example <code>sq</code> output for an array job:'''</big> | |||
< | |||
</ | |||
<pre class="gcomment"> | <pre class="gcomment"> | ||
bc06026@b1-24 workdir$ sq --me | |||
JOBID NAME PARTITION USER NODES CPUS MIN_MEMORY PRIORITY TIME TIME_LIMIT STATE NODELIST(REASON) | |||
4711132_4 array-example batch bc06026 1 1 1G 5993 0:06 1:00:00 RUNNING c5-19 | |||
4711132_3 array-example batch bc06026 1 1 1G 5993 0:06 1:00:00 RUNNING c5-19 | |||
4711132_2 array-example batch bc06026 1 1 1G 5993 0:06 1:00:00 RUNNING c5-19 | |||
4711132_1 array-example batch bc06026 1 1 1G 5993 0:06 1:00:00 RUNNING c4-15 | |||
4711132_0 array-example batch bc06026 1 1 1G 5993 0:06 1:00:00 RUNNING c4-15 | |||
</pre> | </pre> | ||
---- | |||
[[#top|Back to Top]] | |||
== <code>scontrol show job</code> == | |||
< | |||
</ | |||
Another option for viewing information about a pending or running job is <code>scontrol show job ''JOBID''</code>, replacing ''JOBID'' with your job's ID. This command will display information about a pending or running job with one or more key/value pairs of information about the job per line, one line after another (as opposed to the row/column output of <code>squeue</code> and <code>sq</code>). This command will return output while the job is pending or running, and only for a few moments after the job has completed. Using this command with a job that has already finished (more than a few moments ago) will return the output "slurm_load_jobs error: Invalid job id specified". This just means that the job is too old for <code>scontrol show job</code> to display information about it. | |||
<big>'''Example <code>scontrol show job</code> output:'''</big> | |||
<pre class="gcomment"> | <pre class="gcomment"> | ||
JobId= | bc06026@b1-24 workdir$ scontrol show job 4707896 | ||
UserId= | JobId=4707896 JobName=testjob | ||
Priority= | UserId=bc06026(3356) GroupId=gacrc-appadmin(21003) MCS_label=N/A | ||
Priority=5993 Nice=0 Account=gacrc-instruction QOS=normal | |||
JobState=RUNNING Reason=None Dependency=(null) | JobState=RUNNING Reason=None Dependency=(null) | ||
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 | Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 | ||
RunTime=00: | RunTime=00:00:23 TimeLimit=01:00:00 TimeMin=N/A | ||
SubmitTime= | SubmitTime=2021-09-17T09:53:30 EligibleTime=2021-09-17T09:53:30 | ||
StartTime= | AccrueTime=2021-09-17T09:53:30 | ||
StartTime=2021-09-17T09:53:30 EndTime=2021-09-17T10:53:30 Deadline=N/A | |||
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-09-17T09:53:30 | |||
Partition=batch AllocNode:Sid= | Partition=batch AllocNode:Sid=b1-24:36515 | ||
ReqNodeList=(null) ExcNodeList=(null) | ReqNodeList=(null) ExcNodeList=(null) | ||
NodeList= | NodeList=c4-15 | ||
BatchHost= | BatchHost=c4-15 | ||
NumNodes= | NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* | ||
TRES=cpu= | TRES=cpu=1,mem=4G,node=1,billing=1 | ||
Socks/Node=* NtasksPerN:B:S:C= | Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* | ||
MinCPUsNode= | MinCPUsNode=1 MinMemoryNode=4G MinTmpDiskNode=0 | ||
Features= | Features=[Gamma|Beta|Delta|Alpha] DelayBoot=00:00:00 | ||
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) | OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) | ||
Command=/scratch/ | Command=/scratch/bc06026/workdir/sub.sh | ||
WorkDir=/scratch/ | WorkDir=/scratch/bc06026/workdir | ||
StdErr=/scratch/ | StdErr=/scratch/bc06026/workdir/testjob_4707896.err | ||
StdIn=/dev/null | StdIn=/dev/null | ||
StdOut=/scratch/ | StdOut=/scratch/bc06026/workdir/testjob_4707896.out | ||
Power= | Power= | ||
MailUser=bc06026@uga.edu MailType=ALL | |||
NtasksPerTRES:0 | |||
</pre> | </pre> | ||
As you can see, there is a little bit more information presented here than is with <code>squeue</code> and <code>sq</code>, such as showing the path to the job's working directory, Slurm job output file path(s), email information, etc... | |||
---- | |||
[[#top|Back to Top]] | |||
= Finished Jobs = | |||
== <code>sacct</code> and <code>sacct-gacrc</code> == | |||
The easiest way to monitor finished jobs is with the Slurm <code>sacct</code> command. Like most Slurm commands, you are able to control the columns displayed in the output of this command (see <code>man sacct</code> for more information). To save you the trouble of typing a long format string and to make things more convenient, we've created the <code>sacct-gacrc</code> command, which is <code>sacct</code> but pre-formatted and with some additional options for convenience. | |||
A big difference between <code>squeue</code>/<code>sq</code> and <code>sacct</code>/<code>sacct-gacrc</code> is that by default, <code>sacct</code>/<code>sacct-gacrc</code> without any options only shows you YOUR Jobs. Another important note about <code>sacct</code>/<code>sacct-gacrc</code> is that by default it will display Slurm job ''steps''. Unless you're dividing your job into steps with <code>srun</code>, you probably will want <code>sacct</code>/<code>sacct-gacrc</code> to display one line per job (hide job steps, only show job allocation). To do this, use the <code>-X</code> option. For more information on Slurm job allocation, please see the Slurm [https://slurm.schedmd.com/job_launch.html documentation]. | |||
<pre class=" | The default <code>sacct</code> columns are as follows: | ||
<pre class="gcomment"> | |||
JobID JobName Partition Account AllocCPUS State ExitCode | |||
</pre> | </pre> | ||
Using <code>sacct-gacrc</code> runs the <code>sacct</code> command but provides the following columns: | |||
<pre class="gcomment"> | <pre class="gcomment"> | ||
JobID JobName User Partition NNode NCPUS ReqMem CPUTime Elapsed Timelimit State ExitCode NodeList | |||
</pre> | </pre> | ||
As you can see, you're able to get much more useful information with <code>sacct-gacrc</code> than with just the default <code>sacct</code> formatting. | |||
'''Output Columns Explained''' | |||
* '''JobID''': The unique ID of the job. | |||
* '''JobName''': The name of the job. If not specified in one's submission script, it will default to the name of the submission script (e.g. "sub.sh"). | |||
* '''User''': The user who submitted the job. | |||
* '''Partition''': The partition to which the job was sent (e.g. batch, highmem_p, gpu_p, etc...). | |||
* '''NNode''': The number of nodes allocated to the job. | |||
* '''NCPUS''': The number of CPU cores allocated to the job. | |||
* '''ReqMem''': The amount of memory allocated to the job. | |||
* '''CPUTime''': Time used (Elapsed time * CPU core count) by a job. | |||
* '''Elapsed''': How much (wall) time has elapsed since the job started, in the format DAYS-HOURS:MINUTES:SECONDS | |||
* '''Timelimit''': The maximum time given for the job to run, in the format DAYS-HOURS:MINUTES:SECONDS. | |||
* '''State''': The job's state (e.g. Running, Pending, etc...). | |||
* '''ExitCode''': The job's exit code. | |||
* '''Nodelist''': The name of the node(s) on which the job is running or ran. | |||
<code>sacct-gacrc</code> also has a -h/--help option: | |||
<pre class="gcomment"> | |||
bc06026@ss-sub3 ~$ sacct-gacrc --help | |||
Usage: sacct-gacrc [OPTIONS] | |||
Description: preformatted wrapper for sacct. See man sacct for more information. | |||
-E, --endtime Display information about jobs up to a date, in the format of yyyy-mm-dd (default: now) | |||
-j, --jobs Display information about a particular job or jobs (comma-separated list if more than one job) | |||
-r, --partition Display information about jobs from a particular partition | |||
-S, --starttime Display information about jobs starting from a date in the format of yyyy-mm-dd (default: Midnight of today) | |||
-T Display the end time of a particular job or jobs | |||
-u, --user Display information about a particular user's job(s) (default: current user) | |||
-X, --allocations Only show one line per job (do not display job steps) | |||
--debug Display the sacct command being executed | |||
-h, --help Display this help output | |||
</pre> | </pre> | ||
<big><big>'''Examples'''</big></big> | |||
* See information about all of your jobs that started from midnight up to now: <code>sacct-gacrc</code> | |||
* See information about a particular job: <code>sacct-gacrc -j JOBID</code> (replacing JOBID with a particular job ID) | |||
* See information about all of your jobs that started from midnight up to now in the highmem_p: <code>sacct-gacrc -r highmem_p</code> | |||
* See information about your jobs that ran from a particular date up to now: <code>sacct-gacrc -S YYYY-MM-DD</code> (replacing YYYY-MM-DD with a date, e.g. 2021-09-01) | |||
<big>'''Example <code>sacct-gacrc</code> output:'''</big> | |||
<pre class="gcomment"> | |||
bc06026@b1-24 ~$ sacct-gacrc -X -S 2021-09-14 | |||
JobID JobName User Partition NodeList AllocNodes NTask NCPUS ReqMem MaxVMSize State CPUTime Elapsed Timelimit ExitCode | |||
------------ ---------- --------- ---------- ---------- ---------- ----- ----- ------- ---------- ---------- ---------- ---------- ---------- -------- | |||
4580375 interact bc06026 highmem_p ra4-22 1 1 200Gn FAILED 00:00:07 00:00:07 12:00:00 1:0 | |||
4580382 interact bc06026 highmem_p d1-22 1 28 200Gn COMPLETED 00:03:16 00:00:07 12:00:00 0:0 | |||
4584992 interact bc06026 inter_p c4-16 1 1 2Gn COMPLETED 00:00:18 00:00:18 12:00:00 0:0 | |||
... | |||
</pre> | |||
Please note that <code>sacct</code> and <code>sacct-gacrc</code> also show jobs in a RUNNING state. | |||
---- | ---- | ||
[[#top|Back to Top]] | [[#top|Back to Top]] | ||
== <code>seff</code> == | |||
< | <code>seff</code> is a command that can be used to check how efficient a finished job was with the CPU and memory resources it was given. This is a very useful command as it gives insight into optimizing our job's resources. It is important to note that <code>seff</code> is only useful after a job has finished. Using it with a job that is still running will not return an error message, but the CPU and memory usage will be shown as 0.00%, and a warning message will be appended to the bottom of the output that says "WARNING: Efficiency statistics may be misleading for RUNNING jobs.". | ||
</ | |||
<big>'''Example <code>seff</code> output:'''</big> | |||
<pre class="gcomment"> | <pre class="gcomment"> | ||
bc06026@b1-24 workdir$ seff 4707896 | |||
Job ID: 4707896 | |||
Cluster: tc2 | |||
User/Group: bc06026/gacrc-appadmin | |||
State: COMPLETED (exit code 0) | |||
Cores: 1 | |||
CPU Utilized: 00:09:25 | |||
CPU Efficiency: 93.08% of 00:10:07 core-walltime | |||
Job Wall-clock time: 00:10:07 | |||
Memory Utilized: 183.43 MB | |||
Memory Efficiency: 4.48% of 4.00 GB | |||
</pre> | </pre> | ||
The key information shown in the output above are the last 6 rows. We see that the job was allocated 1 CPU core, and that core was doing something for 9 minutes and 25 seconds out of the total 10 minutes and 7 seconds of the job's wall-clock (elapsed) time, (93.08% of the job's wall-clock time). Generally you want to aim for as much utilization of the resources you request as possible, so perhaps this job could run twice as fast given two cores, if the software/command(s) it used parallelized perfectly. The memory utilized for the job in the output above is very low. Four gigabytes of RAM is not much at all in the context of high performance computing, but say for example you requested 100 GB of memory and found that your job was only using ~5% of that memory. In that case you would definitely want to consider lowering the amount of memory you request for that particular job if ran again in the future. For more information on job resource tuning, please see [[ Best Practices on Sapelo2 ]] and [[ Job Resource Tuning ]]. Please note that if your job ends abruptly because it ran out of memory, the <code>seff</code> memory utilization values may not reflect that, as a sudden spike in memory usage could cause the job to get killed before <code>seff</code> notices the spike in consumed memory. | |||
---- | ---- | ||
[[#top|Back to Top]] | [[#top|Back to Top]] |
Latest revision as of 11:54, 23 July 2024
Pending or Running Jobs
squeue
and sq
The easiest way to monitor pending or running jobs is with the Slurm squeue
command. Like most Slurm commands, you are able to control the columns displayed in the output of this command (see man squeue
for more information). To save you the trouble of typing a long format string and to make things more convenient, we've created the sq
command, which is squeue
but pre-formatted and with some additional options for convenience.
The key thing to remember about squeue
/sq
is that without any options, it shows ALL currently running and pending jobs on the cluster. In order to show only your currently running and pending jobs, you will want to use the --me
option.
The default squeue
columns are as follows:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
Using sq
runs the squeue
command but provides the following columns:
JOBID NAME PARTITION USER NODES CPUS MIN_MEMORY PRIORITY TIME TIME_LIMIT STATE NODELIST(REASON)
As you can see, you're able to get much more useful information with sq
than with just the default squeue
formatting.
Output Columns Explained
- JOBID: The unique ID of the job (for an array job, it will be of the form "<base_job_id>_<index>").
- NAME: The name of the job. If not specified in one's submission script, it will default to the name of the submission script (e.g. "sub.sh").
- PARTITION: The partition to which the job was sent (e.g. batch, highmem_p, gpu_p, etc...).
- USER: The user who submitted the job.
- NODES: The number of nodes allocated to the job.
- CPUS: The number of CPU cores allocated to the job.
- MIN_MEMORY: The total amount of memory allocated to the job.
- PRIORITY: The job's priority per Slurm's Multifactor Priority Plugin
- TIME: How much (wall-clock) time has elapsed since the job started, in the format DAYS-HOURS:MINUTES:SECONDS
- TIME_LIMIT: The maximum time given for the job to run, in the format DAYS-HOURS:MINUTES:SECONDS.
- STATE: The job's state (e.g. Running, Pending, etc...)
- NODELIST(REASON): The name of the node(s) on which the job is running or the reason the job has not started yet, if it is pending.
sq
also has a -h/--help option:
bc06026@ss-sub3 ~$ sq --help Usage: sq [OPTIONS] Descriptions: sq - preformatted wrapper for squeue. See man squeue for more information. -j Displays squeue output for a given job --me Displays squeue output for the user executing this command -p Displays squeue output for a given partition -u Displays squeue output for a given user -T Displays submit and start time columns -h, --help Displays this help output
Examples
- See all pending and running jobs:
sq
- See all of your pending and running jobs:
sq --me
- See all pending and running jobs in the highmem_p:
sq -p highmem_p
- See all of your pending and running jobs in the batch partition:
sq --me -p batch
- See all of your pending and running jobs including submit time and start time columns:
sq --me -T
(Note, this will require a wide monitor or small font to display without columns wrapping around)
Example sq
output:
bc06026@ss-sub3 ~$ sq JOBID NAME PARTITION USER NODES CPUS MIN_MEMORY PRIORITY TIME TIME_LIMIT STATE NODELIST(REASON) 4581410 Bowtie2-test batch zp21982 1 1 12G 6003 2:10:56 10:00:00 RUNNING c5-4 4584815 test-job highmem_p rt12352 1 12 300G 5473 1:51:03 2:00:00 RUNNING d3-9 4578428 PR6_Cd3 batch un12354 1 1 40G 5449 4:57:15 1-2:00:00 RUNNING c4-16 4583491 interact inter_p ai38821 1 4 2G 5428 1:57:38 12:00:00 RUNNING d5-21 4580374 BLAST batch gh98762 1 1 10G 5397 2:54:41 12:00:00 RUNNING b1-9 ...
Example sq
output for an array job:
bc06026@b1-24 workdir$ sq --me JOBID NAME PARTITION USER NODES CPUS MIN_MEMORY PRIORITY TIME TIME_LIMIT STATE NODELIST(REASON) 4711132_4 array-example batch bc06026 1 1 1G 5993 0:06 1:00:00 RUNNING c5-19 4711132_3 array-example batch bc06026 1 1 1G 5993 0:06 1:00:00 RUNNING c5-19 4711132_2 array-example batch bc06026 1 1 1G 5993 0:06 1:00:00 RUNNING c5-19 4711132_1 array-example batch bc06026 1 1 1G 5993 0:06 1:00:00 RUNNING c4-15 4711132_0 array-example batch bc06026 1 1 1G 5993 0:06 1:00:00 RUNNING c4-15
scontrol show job
Another option for viewing information about a pending or running job is scontrol show job JOBID
, replacing JOBID with your job's ID. This command will display information about a pending or running job with one or more key/value pairs of information about the job per line, one line after another (as opposed to the row/column output of squeue
and sq
). This command will return output while the job is pending or running, and only for a few moments after the job has completed. Using this command with a job that has already finished (more than a few moments ago) will return the output "slurm_load_jobs error: Invalid job id specified". This just means that the job is too old for scontrol show job
to display information about it.
Example scontrol show job
output:
bc06026@b1-24 workdir$ scontrol show job 4707896 JobId=4707896 JobName=testjob UserId=bc06026(3356) GroupId=gacrc-appadmin(21003) MCS_label=N/A Priority=5993 Nice=0 Account=gacrc-instruction QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:23 TimeLimit=01:00:00 TimeMin=N/A SubmitTime=2021-09-17T09:53:30 EligibleTime=2021-09-17T09:53:30 AccrueTime=2021-09-17T09:53:30 StartTime=2021-09-17T09:53:30 EndTime=2021-09-17T10:53:30 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-09-17T09:53:30 Partition=batch AllocNode:Sid=b1-24:36515 ReqNodeList=(null) ExcNodeList=(null) NodeList=c4-15 BatchHost=c4-15 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=4G,node=1,billing=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=4G MinTmpDiskNode=0 Features=[Gamma|Beta|Delta|Alpha] DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/scratch/bc06026/workdir/sub.sh WorkDir=/scratch/bc06026/workdir StdErr=/scratch/bc06026/workdir/testjob_4707896.err StdIn=/dev/null StdOut=/scratch/bc06026/workdir/testjob_4707896.out Power= MailUser=bc06026@uga.edu MailType=ALL NtasksPerTRES:0
As you can see, there is a little bit more information presented here than is with squeue
and sq
, such as showing the path to the job's working directory, Slurm job output file path(s), email information, etc...
Finished Jobs
sacct
and sacct-gacrc
The easiest way to monitor finished jobs is with the Slurm sacct
command. Like most Slurm commands, you are able to control the columns displayed in the output of this command (see man sacct
for more information). To save you the trouble of typing a long format string and to make things more convenient, we've created the sacct-gacrc
command, which is sacct
but pre-formatted and with some additional options for convenience.
A big difference between squeue
/sq
and sacct
/sacct-gacrc
is that by default, sacct
/sacct-gacrc
without any options only shows you YOUR Jobs. Another important note about sacct
/sacct-gacrc
is that by default it will display Slurm job steps. Unless you're dividing your job into steps with srun
, you probably will want sacct
/sacct-gacrc
to display one line per job (hide job steps, only show job allocation). To do this, use the -X
option. For more information on Slurm job allocation, please see the Slurm documentation.
The default sacct
columns are as follows:
JobID JobName Partition Account AllocCPUS State ExitCode
Using sacct-gacrc
runs the sacct
command but provides the following columns:
JobID JobName User Partition NNode NCPUS ReqMem CPUTime Elapsed Timelimit State ExitCode NodeList
As you can see, you're able to get much more useful information with sacct-gacrc
than with just the default sacct
formatting.
Output Columns Explained
- JobID: The unique ID of the job.
- JobName: The name of the job. If not specified in one's submission script, it will default to the name of the submission script (e.g. "sub.sh").
- User: The user who submitted the job.
- Partition: The partition to which the job was sent (e.g. batch, highmem_p, gpu_p, etc...).
- NNode: The number of nodes allocated to the job.
- NCPUS: The number of CPU cores allocated to the job.
- ReqMem: The amount of memory allocated to the job.
- CPUTime: Time used (Elapsed time * CPU core count) by a job.
- Elapsed: How much (wall) time has elapsed since the job started, in the format DAYS-HOURS:MINUTES:SECONDS
- Timelimit: The maximum time given for the job to run, in the format DAYS-HOURS:MINUTES:SECONDS.
- State: The job's state (e.g. Running, Pending, etc...).
- ExitCode: The job's exit code.
- Nodelist: The name of the node(s) on which the job is running or ran.
sacct-gacrc
also has a -h/--help option:
bc06026@ss-sub3 ~$ sacct-gacrc --help Usage: sacct-gacrc [OPTIONS] Description: preformatted wrapper for sacct. See man sacct for more information. -E, --endtime Display information about jobs up to a date, in the format of yyyy-mm-dd (default: now) -j, --jobs Display information about a particular job or jobs (comma-separated list if more than one job) -r, --partition Display information about jobs from a particular partition -S, --starttime Display information about jobs starting from a date in the format of yyyy-mm-dd (default: Midnight of today) -T Display the end time of a particular job or jobs -u, --user Display information about a particular user's job(s) (default: current user) -X, --allocations Only show one line per job (do not display job steps) --debug Display the sacct command being executed -h, --help Display this help output
Examples
- See information about all of your jobs that started from midnight up to now:
sacct-gacrc
- See information about a particular job:
sacct-gacrc -j JOBID
(replacing JOBID with a particular job ID) - See information about all of your jobs that started from midnight up to now in the highmem_p:
sacct-gacrc -r highmem_p
- See information about your jobs that ran from a particular date up to now:
sacct-gacrc -S YYYY-MM-DD
(replacing YYYY-MM-DD with a date, e.g. 2021-09-01)
Example sacct-gacrc
output:
bc06026@b1-24 ~$ sacct-gacrc -X -S 2021-09-14 JobID JobName User Partition NodeList AllocNodes NTask NCPUS ReqMem MaxVMSize State CPUTime Elapsed Timelimit ExitCode ------------ ---------- --------- ---------- ---------- ---------- ----- ----- ------- ---------- ---------- ---------- ---------- ---------- -------- 4580375 interact bc06026 highmem_p ra4-22 1 1 200Gn FAILED 00:00:07 00:00:07 12:00:00 1:0 4580382 interact bc06026 highmem_p d1-22 1 28 200Gn COMPLETED 00:03:16 00:00:07 12:00:00 0:0 4584992 interact bc06026 inter_p c4-16 1 1 2Gn COMPLETED 00:00:18 00:00:18 12:00:00 0:0 ...
Please note that sacct
and sacct-gacrc
also show jobs in a RUNNING state.
seff
seff
is a command that can be used to check how efficient a finished job was with the CPU and memory resources it was given. This is a very useful command as it gives insight into optimizing our job's resources. It is important to note that seff
is only useful after a job has finished. Using it with a job that is still running will not return an error message, but the CPU and memory usage will be shown as 0.00%, and a warning message will be appended to the bottom of the output that says "WARNING: Efficiency statistics may be misleading for RUNNING jobs.".
Example seff
output:
bc06026@b1-24 workdir$ seff 4707896 Job ID: 4707896 Cluster: tc2 User/Group: bc06026/gacrc-appadmin State: COMPLETED (exit code 0) Cores: 1 CPU Utilized: 00:09:25 CPU Efficiency: 93.08% of 00:10:07 core-walltime Job Wall-clock time: 00:10:07 Memory Utilized: 183.43 MB Memory Efficiency: 4.48% of 4.00 GB
The key information shown in the output above are the last 6 rows. We see that the job was allocated 1 CPU core, and that core was doing something for 9 minutes and 25 seconds out of the total 10 minutes and 7 seconds of the job's wall-clock (elapsed) time, (93.08% of the job's wall-clock time). Generally you want to aim for as much utilization of the resources you request as possible, so perhaps this job could run twice as fast given two cores, if the software/command(s) it used parallelized perfectly. The memory utilized for the job in the output above is very low. Four gigabytes of RAM is not much at all in the context of high performance computing, but say for example you requested 100 GB of memory and found that your job was only using ~5% of that memory. In that case you would definitely want to consider lowering the amount of memory you request for that particular job if ran again in the future. For more information on job resource tuning, please see Best Practices on Sapelo2 and Job Resource Tuning . Please note that if your job ends abruptly because it ran out of memory, the seff
memory utilization values may not reflect that, as a sudden spike in memory usage could cause the job to get killed before seff
notices the spike in consumed memory.