Latest revision as of 13:33, 21 September 2021

Pending or Running Jobs

`squeue` and `sq`

The easiest way to monitor pending or running jobs is with the Slurm squeue command. Like most Slurm commands, you are able to control the columns displayed in the output of this command (see man squeue for more information). To save you the trouble of typing a long format string and to make things more convenient, we've created the sq command, which is squeue but pre-formatted and with some additional options for convenience.

The key thing to remember about squeue/sq is that without any options, it shows ALL currently running and pending jobs on the cluster. In order to show only your currently running and pending jobs, you will want to use the --me option.

The default squeue columns are as follows:

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

Using sq runs the squeue command but provides the following columns:

JOBID      NAME            PARTITION        USER       NODES  CPUS   MIN_MEMORY   PRIORITY   TIME            TIME_LIMIT      STATE      NODELIST(REASON)

As you can see, you're able to get much more useful information with sq than with just the default squeue formatting.

Output Columns Explained

JOBID: The unique ID of the job (for an array job, it will be of the form "<base_job_id>_<index>").
NAME: The name of the job. If not specified in one's submission script, it will default to the name of the submission script (e.g. "sub.sh").
PARTITION: The partition to which the job was sent (e.g. batch, highmem_p, gpu_p, etc...).
USER: The user who submitted the job.
NODES: The number of nodes allocated to the job.
CPUS: The number of CPU cores allocated to the job.
MIN_MEMORY: The total amount of memory allocated to the job.
PRIORITY: The job's priority per Slurm's Multifactor Priority Plugin
TIME: How much (wall-clock) time has elapsed since the job started, in the format DAYS-HOURS:MINUTES:SECONDS
TIME_LIMIT: The maximum time given for the job to run, in the format DAYS-HOURS:MINUTES:SECONDS.
STATE: The job's state (e.g. Running, Pending, etc...)
NODELIST(REASON): The name of the node(s) on which the job is running or the reason the job has not started yet, if it is pending.

sq also has a -h/--help option:

bc06026@ss-sub3 ~$ sq --help

Usage: sq [OPTIONS]

Descriptions: sq - preformatted wrapper for squeue.  See man squeue for more information.

    -j                          Displays squeue output for a given job
    --me                        Displays squeue output for the user executing this command
    -p                          Displays squeue output for a given partition
    -u                          Displays squeue output for a given user
    -T                          Displays submit and start time columns
    -h, --help                  Displays this help output

Examples

See all pending and running jobs: sq
See all of your pending and running jobs: sq --me
See all pending and running jobs in the highmem_p: sq -p highmem_p
See all of your pending and running jobs in the batch partition: sq --me -p batch
See all of your pending and running jobs including submit time and start time columns: sq --me -T (Note, this will require a wide monitor or small font to display without columns wrapping around)

Example sq output:

bc06026@ss-sub3 ~$ sq
JOBID	  NAME	        PARTITION	USER	  NODES	  CPUS	  MIN_MEMORY	PRIORITY     TIME	  TIME_LIMIT	STATE	  NODELIST(REASON)
4581410	  Bowtie2-test	batch	        zp21982	  1	  1	  12G	        6003	     2:10:56	  10:00:00	RUNNING	  c5-4
4584815	  test-job	highmem_p	rt12352	  1	  12	  300G	        5473	     1:51:03	  2:00:00	RUNNING	  d3-9
4578428	  PR6_Cd3	batch	        un12354	  1	  1	  40G	        5449	     4:57:15	  1-2:00:00	RUNNING	  c4-16
4583491	  interact	inter_p	        ai38821	  1	  4	  2G	        5428	     1:57:38	  12:00:00	RUNNING	  d5-21
4580374	  BLAST	        batch	        gh98762	  1	  1	  10G	        5397	     2:54:41	  12:00:00	RUNNING	  b1-9

...

Example sq output for an array job:

bc06026@b1-24 workdir$ sq --me
JOBID         NAME            PARTITION        USER       NODES  CPUS   MIN_MEMORY   PRIORITY   TIME            TIME_LIMIT      STATE      NODELIST(REASON)
4711132_4     array-example   batch            bc06026    1      1      1G           5993       0:06            1:00:00         RUNNING    c5-19
4711132_3     array-example   batch            bc06026    1      1      1G           5993       0:06            1:00:00         RUNNING    c5-19
4711132_2     array-example   batch            bc06026    1      1      1G           5993       0:06            1:00:00         RUNNING    c5-19
4711132_1     array-example   batch            bc06026    1      1      1G           5993       0:06            1:00:00         RUNNING    c4-15
4711132_0     array-example   batch            bc06026    1      1      1G           5993       0:06            1:00:00         RUNNING    c4-15

Back to Top

`scontrol show job`

Another option for viewing information about a pending or running job is scontrol show job JOBID, replacing JOBID with your job's ID. This command will display information about a pending or running job with one or more key/value pairs of information about the job per line, one line after another (as opposed to the row/column output of squeue and sq). This command will return output while the job is pending or running, and only for a few moments after the job has completed. Using this command with a job that has already finished (more than a few moments ago) will return the output "slurm_load_jobs error: Invalid job id specified". This just means that the job is too old for scontrol show job to display information about it.

Example scontrol show job output:

bc06026@b1-24 workdir$ scontrol show job 4707896
JobId=4707896 JobName=testjob
   UserId=bc06026(3356) GroupId=gacrc-appadmin(21003) MCS_label=N/A
   Priority=5993 Nice=0 Account=gacrc-instruction QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:23 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2021-09-17T09:53:30 EligibleTime=2021-09-17T09:53:30
   AccrueTime=2021-09-17T09:53:30
   StartTime=2021-09-17T09:53:30 EndTime=2021-09-17T10:53:30 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-09-17T09:53:30
   Partition=batch AllocNode:Sid=b1-24:36515
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=c4-15
   BatchHost=c4-15
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=4G,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=4G MinTmpDiskNode=0
   Features=[Gamma|Beta|Delta|Alpha] DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/scratch/bc06026/workdir/sub.sh
   WorkDir=/scratch/bc06026/workdir
   StdErr=/scratch/bc06026/workdir/testjob_4707896.err
   StdIn=/dev/null
   StdOut=/scratch/bc06026/workdir/testjob_4707896.out
   Power=
   MailUser=bc06026@uga.edu MailType=ALL
   NtasksPerTRES:0

As you can see, there is a little bit more information presented here than is with squeue and sq, such as showing the path to the job's working directory, Slurm job output file path(s), email information, etc...

Back to Top

Finished Jobs

`sacct` and `sacct-gacrc`

The easiest way to monitor finished jobs is with the Slurm sacct command. Like most Slurm commands, you are able to control the columns displayed in the output of this command (see man sacct for more information). To save you the trouble of typing a long format string and to make things more convenient, we've created the sacct-gacrc command, which is sacct but pre-formatted and with some additional options for convenience.

A big difference between squeue/sq and sacct/sacct-gacrc is that by default, sacct/sacct-gacrc without any options only shows you YOUR Jobs. Another important note about sacct/sacct-gacrc is that by default it will display Slurm job steps. Unless you're dividing your job into steps with srun, you probably will want sacct/sacct-gacrc to display one line per job (hide job steps, only show job allocation). To do this, use the -X option. For more information on Slurm job allocation, please see the Slurm documentation.

The default sacct columns are as follows:

JobID    JobName  Partition    Account  AllocCPUS      State ExitCode

Using sacct-gacrc runs the sacct command but provides the following columns:

JobID         JobName      User  Partition NNode NCPUS   ReqMem    CPUTime    Elapsed  Timelimit      State ExitCode   NodeList

As you can see, you're able to get much more useful information with sacct-gacrc than with just the default sacct formatting.

Output Columns Explained

JobID: The unique ID of the job.
JobName: The name of the job. If not specified in one's submission script, it will default to the name of the submission script (e.g. "sub.sh").
User: The user who submitted the job.
Partition: The partition to which the job was sent (e.g. batch, highmem_p, gpu_p, etc...).
NNode: The number of nodes allocated to the job.
NCPUS: The number of CPU cores allocated to the job.
ReqMem: The amount of memory allocated to the job.
CPUTime: Time used (Elapsed time * CPU core count) by a job.
Elapsed: How much (wall) time has elapsed since the job started, in the format DAYS-HOURS:MINUTES:SECONDS
Timelimit: The maximum time given for the job to run, in the format DAYS-HOURS:MINUTES:SECONDS.
State: The job's state (e.g. Running, Pending, etc...).
ExitCode: The job's exit code.
Nodelist: The name of the node(s) on which the job is running or ran.

sacct-gacrc also has a -h/--help option:

bc06026@ss-sub3 ~$ sacct-gacrc --help

Usage: sacct-gacrc [OPTIONS]

Description: preformatted wrapper for sacct.  See man sacct for more information. 

    -E, --endtime               Display information about jobs up to a date, in the format of yyyy-mm-dd (default: now)
    -j, --jobs                  Display information about a particular job or jobs (comma-separated list if more than one job)
    -r, --partition             Display information about jobs from a particular partition
    -S, --starttime             Display information about jobs starting from a date in the format of yyyy-mm-dd (default: Midnight of today)
    -T                          Display the end time of a particular job or jobs
    -u, --user                  Display information about a particular user's job(s) (default: current user)
    -X, --allocations           Only show one line per job (do not display job steps)
    --debug                     Display the sacct command being executed
    -h, --help                  Display this help output

Examples

See information about all of your jobs that started from midnight up to now: sacct-gacrc
See information about a particular job: sacct-gacrc -j JOBID (replacing JOBID with a particular job ID)
See information about all of your jobs that started from midnight up to now in the highmem_p: sacct-gacrc -r highmem_p
See information about your jobs that from a particular date up to now: sacct-gacrc -S YYYY-MM-DD (replacing YYYY-MM-DD with a date, e.g. 2021-09-01)

Example sacct-gacrc output:

bc06026@b1-24 ~$ sacct-gacrc -X -S 2021-09-14
       JobID    JobName      User  Partition   NodeList AllocNodes NTask NCPUS  ReqMem  MaxVMSize      State    CPUTime    Elapsed  Timelimit ExitCode 
------------ ---------- --------- ---------- ---------- ---------- ----- ----- ------- ---------- ---------- ---------- ---------- ---------- -------- 
4580375        interact   bc06026  highmem_p     ra4-22          1           1   200Gn            FAILED       00:00:07   00:00:07   12:00:00      1:0 
4580382        interact   bc06026  highmem_p      d1-22          1          28   200Gn            COMPLETED    00:03:16   00:00:07   12:00:00      0:0 
4584992        interact   bc06026    inter_p      c4-16          1           1     2Gn            COMPLETED    00:00:18   00:00:18   12:00:00      0:0 

...

Back to Top

`seff`

seff is a command that can be used to check how efficient a finished job was with the CPU and memory resources it was given. This is a very useful command as it gives insight into optimizing our job's resources. It is important to note that seff is only useful after a job has finished. Using it with a job that is still running will not return an error message, but the CPU and memory usage will be shown as 0.00%, and a warning message will be appended to the bottom of the output that says "WARNING: Efficiency statistics may be misleading for RUNNING jobs.".

Example seff output:

bc06026@b1-24 workdir$ seff 4707896
Job ID: 4707896
Cluster: tc2
User/Group: bc06026/gacrc-appadmin
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:09:25
CPU Efficiency: 93.08% of 00:10:07 core-walltime
Job Wall-clock time: 00:10:07
Memory Utilized: 183.43 MB
Memory Efficiency: 4.48% of 4.00 GB

The key information shown in the output above are the last 6 rows. We see that the job was allocated 1 CPU core, and that core was doing something for 9 minutes and 25 seconds out of the total 10 minutes and 7 seconds of the job's wall-clock (elapsed) time, (93.08% of the job's wall-clock time). Generally you want to aim for as much utilization of the resources you request as possible, so perhaps this job could run twice as fast given two cores, if the software/command(s) it used parallelized perfectly. The memory utilized for the job in the output above is very low. Four gigabytes of RAM is not much at all in the context of high performance computing, but say for example you requested 100 GB of memory and found that your job was only using ~5% of that memory. In that case you would definitely want to consider lowering the amount of memory you request for that particular job if ran again in the future. For more information on job resource tuning, please see Best Practices on Sapelo2 and Job Resource Tuning . Please note that if your job ends abruptly because it ran out of memory, the seff memory utilization values may not reflect that, as a sudden spike in memory usage could cause the job to get killed before seff notices the spike in consumed memory.

Back to Top

Difference between revisions of "Monitoring Jobs on Sapelo2"

Latest revision as of 13:33, 21 September 2021

Contents

Pending or Running Jobs

`squeue` and `sq`

`scontrol show job`

Finished Jobs

`sacct` and `sacct-gacrc`

`seff`

Navigation menu

Search

@@ Line 1: / Line 1: @@
-[[Category:Sapelo2]]
+__TOC__
-<span style="color:red">
+= Pending or Running Jobs =
-'''This page is being written in preparation for switching the queueing system on Sapelo2 to Slurm, it is not applicable to Sapelo2 yet.'''
-</span>
-<span style="color:red">
+==<code>squeue</code> and <code>sq</code>==
-'''This page is applicable to the Slurm development cluster (SapSlurm)'''
-</span>
-===How to list all user jobs in the queues===
+The easiest way to monitor pending or running jobs is with the Slurm <code>squeue</code> command.  Like most Slurm commands, you are able to control the columns displayed in the output of this command (see <code>man squeue</code> for more information).  To save you the trouble of typing a long format string and to make things more convenient, we've created the <code>sq</code> command, which is <code>squeue</code> but pre-formatted and with some additional options for convenience.
-To list all running and pending jobs (by all users), use the command
+The key thing to remember about <code>squeue</code>/<code>sq</code> is that without any options, it shows ALL currently running and pending jobs on the cluster.  In order to show only your currently running and pending jobs, you will want to use the <code>--me</code> option.
-<pre class="gcommand">
-squeue
+The default <code>squeue</code> columns are as follows:
+<pre class="gcomment">
+JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 </pre>
-Sample '''squeue''' output:
+Using <code>sq</code> runs the <code>squeue</code> command but provides the following columns:
 <pre class="gcomment">
-            JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
+JOBID      NAME            PARTITION        USER       NODES  CPUS   MIN_MEMORY   PRIORITY   TIME            TIME_LIMIT      STATE      NODELIST(REASON)
-     batch slurmenv  ab12345 PD       0:00      1 (Dependency)
-  inter_p     bash   shtsai  R       0:04      1 c2-4
-     batch  mpitest   shtsai  R       2:05      2 c2-[11-12]
-     gpu_p amberjob   shtsai  R      11:08      1 c4-23
-     batch slurmenv  ab12345  R       0:44      1 ra3-3
 </pre>
-'''Output explained:'''
+As you can see, you're able to get much more useful information with <code>sq</code> than with just the default <code>squeue</code> formatting.
-The column entitled '''JOBID''' gives the job id of each job.
-The column entitled '''PARTITION''' gives the partition or queue where the job is running. A job in the inter_p partition is an interactive session (e.g. one that was launched with the '''qlogin''' command).
+'''Output Columns Explained'''
-The column entitled '''NAME''' gives the name of the job (specified in the job submission script with the --job-name or -J option.
+* '''JOBID''': The unique ID of the job (for an array job, it will be of the form "<base_job_id>_<index>").
+* '''NAME''': The name of the job.  If not specified in one's submission script, it will default to the name of the submission script (e.g. "sub.sh").
+* '''PARTITION''': The partition to which the job was sent (e.g. batch, highmem_p, gpu_p, etc...).
+* '''USER''': The user who submitted the job.
+* '''NODES''': The number of nodes allocated to the job.
+* '''CPUS''': The number of CPU cores allocated to the job.
+* '''MIN_MEMORY''': The total amount of memory allocated to the job.
+* '''PRIORITY''': The job's priority per Slurm's [https://slurm.schedmd.com/priority_multifactor.html Multifactor Priority Plugin]
+* '''TIME''': How much (wall-clock) time has elapsed since the job started, in the format DAYS-HOURS:MINUTES:SECONDS
+* '''TIME_LIMIT''': The maximum time given for the job to run, in the format DAYS-HOURS:MINUTES:SECONDS.
+* '''STATE''': The job's state (e.g. Running, Pending, etc...)
+* '''NODELIST(REASON)''': The name of the node(s) on which the job is running or the reason the job has not started yet, if it is pending.
-The column entitled '''USER''' gives the username (MyID) of the user running the job.
-The column entitled '''ST''' gives the status of a job, which could be
+<code>sq</code> also has a -h/--help option:
-* '''R''' : job is running
-* '''PD''' : job is pending, waiting for resources to become available or for a job dependency to be satisfied
-* '''CG''' : job is being cleared (completed, was deleted, crashed) and is no longer running.
-The column entitled '''TIME''' gives the walltime of the job (this is not the CPU time of the job).
+<pre class="gcomment">
+bc06026@ss-sub3 ~$ sq --help
-The column entitled '''NODES''' specifies how many nodes are being used by the job.
+Usage: sq [OPTIONS]
-The column entitled '''NODELIST(REASON)''' lists the hostnames of the nodes used by running jobs. For pending jobs, this column lists the reason the job is pending. For example '''Dependency''' means the job is waiting for a job dependency to be satisfied.
+Descriptions: sq - preformatted wrapper for squeue.  See man squeue for more information.
+    -j                          Displays squeue output for a given job
+    --me                        Displays squeue output for the user executing this command
+    -p                          Displays squeue output for a given partition
+    -u                          Displays squeue output for a given user
+    -T                          Displays submit and start time columns
+    -h, --help                  Displays this help output
-The command
-<pre class="gcommand">
-squeue -l
 </pre>
-adds an extra column in the output for the TIME_LIMIT of the jobs.
-==How to list jobs of a single user in the queues===
+<big><big>'''Examples'''</big></big>
+* See all pending and running jobs: <code>sq</code>
+* See all of your pending and running jobs: <code>sq --me</code>
+* See all pending and running jobs in the highmem_p: <code>sq -p highmem_p</code>
+* See all of your pending and running jobs in the batch partition: <code>sq --me -p batch</code>
+* See all of your pending and  running jobs including submit time and start time columns: <code>sq --me -T</code> (Note, this will require a wide monitor or small font to display without columns wrapping around)
-To list all the running and pending jobs of a single user, use the command
-<pre class="gcommand">
-squeue -u MyID
-</pre>
-where MyID needs to be replaced by the user's cluster username (UGA MyID).
+<big>'''Example <code>sq</code> output:'''</big>
-===How to list only my jobs in the queues===
+<pre class="gcomment">
+bc06026@ss-sub3 ~$ sq
+JOBID	  NAME	        PARTITION	USER	  NODES	  CPUS	  MIN_MEMORY	PRIORITY     TIME	  TIME_LIMIT	STATE	  NODELIST(REASON)
+4581410	  Bowtie2-test	batch	        zp21982	  1	  1	  12G	        6003	     2:10:56	  10:00:00	RUNNING	  c5-4
+4584815	  test-job	highmem_p	rt12352	  1	  12	  300G	        5473	     1:51:03	  2:00:00	RUNNING	  d3-9
+4578428	  PR6_Cd3	batch	        un12354	  1	  1	  40G	        5449	     4:57:15	  1-2:00:00	RUNNING	  c4-16
+4583491	  interact	inter_p	        ai38821	  1	  4	  2G	        5428	     1:57:38	  12:00:00	RUNNING	  d5-21
+4580374	  BLAST	        batch	        gh98762	  1	  1	  10G	        5397	     2:54:41	  12:00:00	RUNNING	  b1-9
-To list all your running and pending jobs, you can use the command
+...
-<pre class="gcommand">
-squeue --me
-</pre>
-or
-<pre class="gcommand">
-squeue --me -l
 </pre>
-You can also use the command
+<big>'''Example <code>sq</code> output for an array job:'''</big>
-<pre class="gcommand">
-squeue -u MyID
-</pre>
-where MyID needs to be replaced by your own cluster username (UGA MyID).
-Sample output (as ran by user shtsai):
 <pre class="gcomment">
-            JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
+bc06026@b1-24 workdir$ sq --me
-  inter_p     bash   shtsai  R       0:04      1 c2-4
+JOBID         NAME            PARTITION        USER       NODES  CPUS   MIN_MEMORY   PRIORITY   TIME            TIME_LIMIT      STATE      NODELIST(REASON)
-    batch  mpitest   shtsai  R       2:05      2 c2-[11-12]
+4711132_4     array-example   batch            bc06026    1      1      1G           5993       0:06            1:00:00         RUNNING    c5-19
-    gpu_p amberjob   shtsai  R      11:08      1 c4-23
+4711132_3     array-example   batch            bc06026    1      1      1G           5993       0:06            1:00:00         RUNNING    c5-19
-</pre>
+4711132_2     array-example   batch            bc06026    1      1      1G           5993       0:06            1:00:00         RUNNING    c5-19
+4711132_1     array-example   batch            bc06026    1      1      1G           5993       0:06            1:00:00         RUNNING    c4-15
-===How to list all of my running jobs===
+4711132_0     array-example   batch            bc06026    1      1      1G           5993       0:06            1:00:00         RUNNING    c4-15
-<pre class="gcommand">
-squeue -u MyID -t RUNNING
 </pre>
-===How to list all of my pending jobs===
-<pre class="gcommand">
+----
-squeue -u MyID -t PENDING
+[[#top|Back to Top]]
-</pre>
-===How to determine on which node(s) jobs are running===
-To see which nodes were assigned to each or your jobs, use the command
+== <code>scontrol show job</code> ==
-<pre class="gcommand">
-squeue
-</pre>
-The column entitled NODELIST lists the hostnames of the nodes allocated to the jobs.
+Another option for viewing information about a pending or running job is <code>scontrol show job ''JOBID''</code>, replacing ''JOBID'' with your job's ID.  This command will display information about a pending or running job with one or more key/value pairs of information about the job per line, one line after another (as opposed to the row/column output of <code>squeue</code> and <code>sq</code>).  This command will return output while the job is pending or running, and only for a few moments after the job has completed.  Using this command with a job that has already finished (more than a few moments ago) will return the output "slurm_load_jobs error: Invalid job id specified".  This just means that the job is too old for <code>scontrol show job</code> to display information about it.
-===How to see detailed information about a given job===
-To see a long listing with detailed information for a running job (STATE = R), including the initial working directory, the number of cores and memory requested, the job submission and start time, use the command
+<big>'''Example <code>scontrol show job</code> output:'''</big>
-<pre class="gcommand">
-scontrol show job JOBID
-</pre>
-where JOBID should be replaced by the '''JOBID''' for the job you wish to check (the JOBID is given in the first column
-of the '''squeue''' output).
-Sample output of a long job listing ''scontrol show job 11279'':
 <pre class="gcomment">
-JobId=11279 JobName=mpitest
+bc06026@b1-24 workdir$ scontrol show job 4707896
-    UserId=ab12345(10012) GroupId=abclab(1001) MCS_label=N/A
+JobId=4707896 JobName=testjob
-    Priority=1 Nice=0 Account=gacrc QOS=normal
+    UserId=bc06026(3356) GroupId=gacrc-appadmin(21003) MCS_label=N/A
+    Priority=5993 Nice=0 Account=gacrc-instruction QOS=normal
     JobState=RUNNING Reason=None Dependency=(null)
     Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
-    RunTime=00:02:24 TimeLimit=02:00:00 TimeMin=N/A
+    RunTime=00:00:23 TimeLimit=01:00:00 TimeMin=N/A
-    SubmitTime=2018-08-14T12:19:23 EligibleTime=2018-08-14T12:19:23
+    SubmitTime=2021-09-17T09:53:30 EligibleTime=2021-09-17T09:53:30
-    StartTime=2018-08-14T12:19:23 EndTime=2018-08-14T14:19:23 Deadline=N/A
+   AccrueTime=2021-09-17T09:53:30
-    PreemptTime=None SuspendTime=None SecsPreSuspend=0
+    StartTime=2021-09-17T09:53:30 EndTime=2021-09-17T10:53:30 Deadline=N/A
-   LastSchedEval=2018-08-14T12:19:23
+    SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-09-17T09:53:30
-    Partition=batch AllocNode:Sid=c2-4:872
+    Partition=batch AllocNode:Sid=b1-24:36515
     ReqNodeList=(null) ExcNodeList=(null)
-    NodeList=c2-[11-12]
+    NodeList=c4-15
-    BatchHost=c2-11
+    BatchHost=c4-15
-    NumNodes=2 NumCPUs=24 NumTasks=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
+    NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
-    TRES=cpu=24,mem=14400M,node=2,billing=24
+    TRES=cpu=1,mem=4G,node=1,billing=1
-    Socks/Node=* NtasksPerN:B:S:C=12:0:*:* CoreSpec=*
+    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
-    MinCPUsNode=12 MinMemoryCPU=600M MinTmpDiskNode=0
+    MinCPUsNode=1 MinMemoryNode=4G MinTmpDiskNode=0
-    Features=(null) DelayBoot=00:00:00
+    Features=[Gamma|Beta|Delta|Alpha] DelayBoot=00:00:00
-   Gres=(null) Reservation=(null)
     OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
-    Command=/home/ab12345/tests/slurm/mpi/sub.sh
+    Command=/scratch/bc06026/workdir/sub.sh
-    WorkDir=/home/ab12345/tests/slurm/mpi
+    WorkDir=/scratch/bc06026/workdir
-    StdErr=/home/ab12345/tests/slurm/mpi/mpitest.o11279
+    StdErr=/scratch/bc06026/workdir/testjob_4707896.err
     StdIn=/dev/null
-    StdOut=/home/shtsai/tests/slurm/mpi/mpitest.o11279
+    StdOut=/scratch/bc06026/workdir/testjob_4707896.out
     Power=
+   MailUser=bc06026@uga.edu MailType=ALL
+   NtasksPerTRES:0
 </pre>
-The long listing of information is not available any more for jobs that are no longer in a running state.
+As you can see, there is a little bit more information presented here than is with <code>squeue</code> and <code>sq</code>, such as showing the path to the job's working directory, Slurm job output file path(s), email information, etc...
+----
+[[#top|Back to Top]]
+= Finished Jobs =
+== <code>sacct</code> and <code>sacct-gacrc</code> ==
+The easiest way to monitor finished jobs is with the Slurm <code>sacct</code> command.  Like most Slurm commands, you are able to control the columns displayed in the output of this command (see <code>man sacct</code> for more information).  To save you the trouble of typing a long format string and to make things more convenient, we've created the <code>sacct-gacrc</code> command, which is <code>sacct</code> but pre-formatted and with some additional options for convenience.
+A big difference between <code>squeue</code>/<code>sq</code> and <code>sacct</code>/<code>sacct-gacrc</code> is that by default, <code>sacct</code>/<code>sacct-gacrc</code> without any options only shows you YOUR Jobs.  Another important note about <code>sacct</code>/<code>sacct-gacrc</code> is that by default it will display Slurm job ''steps''.  Unless you're dividing your job into steps with <code>srun</code>, you probably will want <code>sacct</code>/<code>sacct-gacrc</code> to display one line per job (hide job steps, only show job allocation).  To do this, use the <code>-X</code> option.  For more information on Slurm job allocation, please see the Slurm [https://slurm.schedmd.com/job_launch.html documentation].
-Note that the long listing above does not include the resources utilized by the job.
-For your own jobs that are still running or that already completed, you can check how much memory was used, how much CPU time was used, the working directory, etc, please use the command
+The default <code>sacct</code> columns are as follows:
-<pre class="gcommand">
-sacct-gacrc
+<pre class="gcomment">
+JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
 </pre>
-Sample output of an sacct-gacrc command:
+Using <code>sacct-gacrc</code> runs the <code>sacct</code> command but provides the following columns:
 <pre class="gcomment">
-       JobID    JobName      User  Partition        NodeList AllocNodes   NTasks      NCPUS     ReqMem  MaxVMSize      State    CPUTime    Elapsed  Timelimit ExitCode              WorkDir
+JobID         JobName      User  Partition NNode NCPUS   ReqMem    CPUTime    Elapsed  Timelimit      State ExitCode   NodeList
------------- ---------- --------- ---------- --------------- ---------- -------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- -------- --------------------
-            slurmenv    shtsai      batch           c1-38          1                   2       10Gn             COMPLETED   00:00:00   00:00:00   02:00:00      0:0 /home/shtsai/tests/+
-.batch         batch                                c1-38          1        1          2       10Gn    197240K  COMPLETED   00:00:00   00:00:00                 0:0
-.extern       extern                                c1-38          1        1          2       10Gn    169800K  COMPLETED   00:00:00   00:00:00                 0:0
-            amberjob    shtsai      gpu_p           c4-23          1                   1       10Gn            CANCELLED+   00:03:19   00:03:19   02:00:00      0:0 /home/shtsai/tests/+
-.batch         batch                                c4-23         1        1          1       10Gn    221140K  CANCELLED   00:03:20   00:03:20                0:15
-.extern       extern                                c4-23          1        1          1       10Gn    169800K  COMPLETED   00:03:19   00:03:19                 0:0
-             mpitest    shtsai      batch      c2-[11-12]          2                  24      600Mc             COMPLETED   04:01:12   00:10:03   02:00:00      0:0 /home/shtsai/tests/+
-.batch         batch                                c2-11          1        1         12      600Mc    221268K  COMPLETED   02:00:36   00:10:03                 0:0
-.extern       extern                           c2-[11-12]          2        2         24      600Mc    169800K  COMPLETED   04:01:12   00:10:03                 0:0
-.0             orted                                c2-12          1        1          1      600Mc    265640K  COMPLETED   00:00:01   00:00:01                 0:0
-                bash    shtsai    inter_p            c2-4          1                   1        2Gn               RUNNING   00:13:37   00:13:37   12:00:00      0:0 /home/shtsai/tests/+
-.extern       extern                                 c2-4          1        1          1        2Gn               RUNNING   00:13:37   00:13:37                 0:0
-.0              bash                                 c2-4          1        1          1        2Gn               RUNNING   00:13:37   00:13:37                 0:0
 </pre>
-Note that each job will have several entries in this output, and these correspond to the different steps of the job. In general, non-MPI jobs that are still running will have two entries listed (jobid and jobid.extern), non-MPI jobs that completed or that were cancelled will have 3 entries (jobid, jobid.batch, jobid.extern), and MPI jobs will have an extra entry (jobid.0).
+As you can see, you're able to get much more useful information with <code>sacct-gacrc</code> than with just the default <code>sacct</code> formatting.
+'''Output Columns Explained'''
+* '''JobID''': The unique ID of the job.
+* '''JobName''': The name of the job.  If not specified in one's submission script, it will default to the name of the submission script (e.g. "sub.sh").
+* '''User''': The user who submitted the job.
+* '''Partition''': The partition to which the job was sent (e.g. batch, highmem_p, gpu_p, etc...).
+* '''NNode''': The number of nodes allocated to the job.
+* '''NCPUS''': The number of CPU cores allocated to the job.
+* '''ReqMem''': The amount of memory allocated to the job.
+* '''CPUTime''': Time used (Elapsed time * CPU core count) by a job.
+* '''Elapsed''': How much (wall) time has elapsed since the job started, in the format DAYS-HOURS:MINUTES:SECONDS
+* '''Timelimit''': The maximum time given for the job to run, in the format DAYS-HOURS:MINUTES:SECONDS.
+* '''State''': The job's state (e.g. Running, Pending, etc...).
+* '''ExitCode''': The job's exit code.
+* '''Nodelist''': The name of the node(s) on which the job is running or ran.
+<code>sacct-gacrc</code> also has a -h/--help option:
+<pre class="gcomment">
+bc06026@ss-sub3 ~$ sacct-gacrc --help
+Usage: sacct-gacrc [OPTIONS]
+Description: preformatted wrapper for sacct.  See man sacct for more information.
+    -E, --endtime               Display information about jobs up to a date, in the format of yyyy-mm-dd (default: now)
+    -j, --jobs                  Display information about a particular job or jobs (comma-separated list if more than one job)
+    -r, --partition             Display information about jobs from a particular partition
+    -S, --starttime             Display information about jobs starting from a date in the format of yyyy-mm-dd (default: Midnight of today)
+    -T                          Display the end time of a particular job or jobs
+    -u, --user                  Display information about a particular user's job(s) (default: current user)
+    -X, --allocations           Only show one line per job (do not display job steps)
+    --debug                     Display the sacct command being executed
+    -h, --help                  Display this help output
-To see all fields reported by the accounting command for a given job, use the command
-<pre class="gcommand">
-sacct -l -j JOBID
 </pre>
-where JOBID should be replaced by the '''JOBID''' for the job you wish to check (the JOBID is given in the first column of the '''squeue''' output).
+<big><big>'''Examples'''</big></big>
+* See information about all of your jobs that started from midnight up to now: <code>sacct-gacrc</code>
+* See information about a particular job: <code>sacct-gacrc -j JOBID</code> (replacing JOBID with a particular job ID)
+* See information about all of your jobs that started from midnight up to now in the highmem_p: <code>sacct-gacrc -r highmem_p</code>
+* See information about your jobs that from a particular date up to now: <code>sacct-gacrc -S YYYY-MM-DD</code> (replacing YYYY-MM-DD with a date, e.g. 2021-09-01)
+<big>'''Example <code>sacct-gacrc</code> output:'''</big>
+<pre class="gcomment">
+bc06026@b1-24 ~$ sacct-gacrc -X -S 2021-09-14
+       JobID    JobName      User  Partition   NodeList AllocNodes NTask NCPUS  ReqMem  MaxVMSize      State    CPUTime    Elapsed  Timelimit ExitCode
+------------ ---------- --------- ---------- ---------- ---------- ----- ----- ------- ---------- ---------- ---------- ---------- ---------- --------
+4580375        interact   bc06026  highmem_p     ra4-22          1           1   200Gn            FAILED       00:00:07   00:00:07   12:00:00      1:0
+4580382        interact   bc06026  highmem_p      d1-22          1          28   200Gn            COMPLETED    00:03:16   00:00:07   12:00:00      0:0
+4584992        interact   bc06026    inter_p      c4-16          1           1     2Gn            COMPLETED    00:00:18   00:00:18   12:00:00      0:0
+...
+</pre>
@@ Line 190: / Line 236: @@
 [[#top|Back to Top]]
-===Monitoring Array Jobs===
-To view an array job, use the command
+== <code>seff</code> ==
-<pre class="gcommand">
+<code>seff</code> is a command that can be used to check how efficient a finished job was with the CPU and memory resources it was given.  This is a very useful command as it gives insight into optimizing our job's resources.  It is important to note that <code>seff</code> is only useful after a job has finished.  Using it with a job that is still running will not return an error message, but the CPU and memory usage will be shown as 0.00%, and a warning message will be appended to the bottom of the output that says "WARNING: Efficiency statistics may be misleading for RUNNING jobs.".
-squeue -l
-</pre>
-Sample output of the '''squeue -l''' command for an array job that has JOBID=81 and that has 10 elements:
+<big>'''Example <code>seff</code> output:'''</big>
 <pre class="gcomment">
-Fri Aug  3 16:38:03 2018
+bc06026@b1-24 workdir$ seff 4707896
-             JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
+Job ID: 4707896
-_0     batch arrayjob   shtsai  RUNNING       0:03  10:00:00      1 c2-11
+Cluster: tc2
-_1     batch arrayjob   shtsai  RUNNING       0:03  10:00:00      1 c2-11
+User/Group: bc06026/gacrc-appadmin
-_2     batch arrayjob   shtsai  RUNNING       0:03  10:00:00      1 c2-11
+State: COMPLETED (exit code 0)
-_3     batch arrayjob   shtsai  RUNNING       0:03  10:00:00      1 c2-11
+Cores: 1
-_4     batch arrayjob   shtsai  RUNNING       0:03  10:00:00      1 c2-11
+CPU Utilized: 00:09:25
-_5     batch arrayjob   shtsai  RUNNING       0:03  10:00:00      1 c2-11
+CPU Efficiency: 93.08% of 00:10:07 core-walltime
-_6     batch arrayjob   shtsai  RUNNING       0:03  10:00:00      1 c2-11
+Job Wall-clock time: 00:10:07
-_7     batch arrayjob   shtsai  RUNNING       0:03  10:00:00      1 c2-11
+Memory Utilized: 183.43 MB
-_8     batch arrayjob   shtsai  RUNNING       0:03  10:00:00      1 c2-11
+Memory Efficiency: 4.48% of 4.00 GB
-_9     batch arrayjob   shtsai  RUNNING       0:03  10:00:00      1 c2-11
 </pre>
+The key information shown in the output above are the last 6 rows.  We see that the job was allocated 1 CPU core, and that core was doing something for 9 minutes and 25 seconds out of the total 10 minutes and 7 seconds of the job's wall-clock (elapsed) time, (93.08% of the job's wall-clock time).  Generally you want to aim for as much utilization of the resources you request as possible, so perhaps this job could run twice as fast given two cores, if the software/command(s) it used parallelized perfectly.  The memory utilized for the job in the output above is very low.  Four gigabytes of RAM is not much at all in the context of high performance computing, but say for example you requested 100 GB of memory and found that your job was only using ~5% of that memory.  In that case you would definitely want to consider lowering the amount of memory you request for that particular job if ran again in the future.  For more information on job resource tuning, please see [[ Best Practices on Sapelo2 ]] and [[ Job Resource Tuning ]].  Please note that if your job ends abruptly because it ran out of memory, the <code>seff</code> memory utilization values may not reflect that, as a sudden spike in memory usage could cause the job to get killed before <code>seff</code> notices the spike in consumed memory.
 ----
 [[#top|Back to Top]]

Difference between revisions of "Monitoring Jobs on Sapelo2"

Latest revision as of 13:33, 21 September 2021

Pending or Running Jobs

squeue and sq

scontrol show job

Finished Jobs

sacct and sacct-gacrc

seff

Navigation menu

Search

`squeue` and `sq`

`scontrol show job`

`sacct` and `sacct-gacrc`

`seff`