Tmp: Difference between revisions

From Research Computing Center Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 89: Line 89:
== <code>scontrol show job</code> ==
== <code>scontrol show job</code> ==


insert info/examples here
Another option for viewing information about a pending or running job is <code>scontrol show job ''JOBID''</code>, replacing ''JOBID'' with your job's ID.  This command will display information about a pending or running job with one or more key/value pairs of information about the job per line, one line after another (as opposed to the row/column output of <code>squeue</code> and <code>sq</code>).  This command will return output while the job is pending or running, and only for a few moments after the job has completed.  Using this command with a job that has already finished (more than a few moments ago) will return the output "slurm_load_jobs error: Invalid job id specified".  This just means that the job is too old for <code>scontrol show job</code> to display information about it.
 
 
<big>'''Example <code>scontrol show job</code> output:'''</big>
 
<pre class="gcomment">
bc06026@b1-24 workdir$ scontrol show job 4707896
JobId=4707896 JobName=testjob
  UserId=bc06026(3356) GroupId=gacrc-appadmin(21003) MCS_label=N/A
  Priority=5993 Nice=0 Account=gacrc-instruction QOS=normal
  JobState=RUNNING Reason=None Dependency=(null)
  Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
  RunTime=00:00:23 TimeLimit=01:00:00 TimeMin=N/A
  SubmitTime=2021-09-17T09:53:30 EligibleTime=2021-09-17T09:53:30
  AccrueTime=2021-09-17T09:53:30
  StartTime=2021-09-17T09:53:30 EndTime=2021-09-17T10:53:30 Deadline=N/A
  SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-09-17T09:53:30
  Partition=batch AllocNode:Sid=b1-24:36515
  ReqNodeList=(null) ExcNodeList=(null)
  NodeList=c4-15
  BatchHost=c4-15
  NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
  TRES=cpu=1,mem=4G,node=1,billing=1
  Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
  MinCPUsNode=1 MinMemoryNode=4G MinTmpDiskNode=0
  Features=[Gamma|Beta|Delta|Alpha] DelayBoot=00:00:00
  OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
  Command=/scratch/bc06026/workdir/sub.sh
  WorkDir=/scratch/bc06026/workdir
  StdErr=/scratch/bc06026/workdir/testjob_4707896.err
  StdIn=/dev/null
  StdOut=/scratch/bc06026/workdir/testjob_4707896.out
  Power=
  MailUser=bc06026@uga.edu MailType=ALL
  NtasksPerTRES:0
 
</pre>
 
As you can see, there is a little bit more information presented here than is with <code>squeue</code> and <code>sq</code>, such as showing the path to the job's working directory, Slurm job output file path(s), email information, etc...





Revision as of 09:11, 17 September 2021

Pending or Running Jobs

squeue and sq

The easiest way to monitor pending or running jobs is with the Slurm squeue command. Like most Slurm commands, you are able to control the columns displayed in the output of this command (see man squeue for more information). To save you that trouble and to make things more convenient, we've created the sq command, which is squeue but pre-formatted and with some additional options for convenience.

The key thing to remember about squeue/sq is that without any options, it shows ALL currently running and pending jobs on the cluster. In order to show only your currently running and pending jobs, you will want to use the --me option.


The default squeue columns are as follows:

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

Using sq runs the squeue command but provides the following columns:

JOBID      NAME            PARTITION        USER       NODES  CPUS   MIN_MEMORY   PRIORITY   TIME            TIME_LIMIT      STATE      NODELIST(REASON)

As you can see, you're able to get much more useful information with sq than with just the default squeue formatting.


Output Columns Explained

  • JOBID: The unique ID of the job.
  • NAME: The name of the job. If not specified in one's submission script, it will default to the name of the submission script (e.g. "sub.sh").
  • PARTITION: The partition to which the job was sent (e.g. batch, highmem_p, gpu_p, etc...).
  • USER: The user who submitted the job.
  • NODES: The number of nodes allocated to the job.
  • CPUS: The number of CPU cores allocated to the job.
  • MIN_MEMORY: The amount of memory allocated to the job.
  • PRIORITY: The job's priority per Slurm's Multifactor Priority Plugin
  • TIME: How much (wall) time has elapsed since the job started, in the format DAYS-HOURS:MINUTES:SECONDS
  • TIME_LIMIT: The maximum time given for the job to run, in the format DAYS-HOURS:MINUTES:SECONDS.
  • STATE: The job's state (e.g. Running, Pending, etc...)
  • NODELIST(REASON): The name of the node(s) on which the job is running or the reason the job has not started yet, if it is pending.


sq also has a -h/--help option:

bc06026@ss-sub3 ~$ sq --help

Usage: sq [OPTIONS]

Descriptions: sq - preformatted wrapper for squeue.  See man squeue for more information.

    -j                          Displays squeue output for a given job
    --me                        Displays squeue output for the user executing this command
    -p                          Displays squeue output for a given partition
    -u                          Displays squeue output for a given user
    -T                          Displays submit and start time columns
    -h, --help                  Displays this help output


Examples

  • See all pending and running jobs: sq
  • See all of your pending and running jobs: sq --me
  • See all pending and running jobs in the highmem_p: sq -p highmem_p
  • See all of your pending and running jobs in the batch partition: sq --me -p batch
  • See all of your pending and running jobs including submit time and start time columns: sq --me -T (Note, this will require a wide monitor or small font to display without columns wrapping around)


Example sq output:

bc06026@ss-sub3 ~$ sq
JOBID	  NAME	        PARTITION	USER	  NODES	  CPUS	  MIN_MEMORY	PRIORITY     TIME	  TIME_LIMIT	STATE	  NODELIST(REASON)
4581410	  Bowtie2-test	batch	        zp21982	  1	  1	  12G	        6003	     2:10:56	  10:00:00	RUNNING	  c5-4
4584815	  test-job	highmem_p	rt12352	  1	  12	  300G	        5473	     1:51:03	  2:00:00	RUNNING	  d3-9
4578428	  PR6_Cd3	batch	        un12354	  1	  1	  40G	        5449	     4:57:15	  1-2:00:00	RUNNING	  c4-16
4583491	  interact	inter_p	        ai38821	  1	  4	  2G	        5428	     1:57:38	  12:00:00	RUNNING	  d5-21
4580374	  BLAST	        batch	        gh98762	  1	  1	  10G	        5397	     2:54:41	  12:00:00	RUNNING	  b1-9

...

Back to Top


scontrol show job

Another option for viewing information about a pending or running job is scontrol show job JOBID, replacing JOBID with your job's ID. This command will display information about a pending or running job with one or more key/value pairs of information about the job per line, one line after another (as opposed to the row/column output of squeue and sq). This command will return output while the job is pending or running, and only for a few moments after the job has completed. Using this command with a job that has already finished (more than a few moments ago) will return the output "slurm_load_jobs error: Invalid job id specified". This just means that the job is too old for scontrol show job to display information about it.


Example scontrol show job output:

bc06026@b1-24 workdir$ scontrol show job 4707896
JobId=4707896 JobName=testjob
   UserId=bc06026(3356) GroupId=gacrc-appadmin(21003) MCS_label=N/A
   Priority=5993 Nice=0 Account=gacrc-instruction QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:23 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2021-09-17T09:53:30 EligibleTime=2021-09-17T09:53:30
   AccrueTime=2021-09-17T09:53:30
   StartTime=2021-09-17T09:53:30 EndTime=2021-09-17T10:53:30 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-09-17T09:53:30
   Partition=batch AllocNode:Sid=b1-24:36515
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=c4-15
   BatchHost=c4-15
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=4G,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=4G MinTmpDiskNode=0
   Features=[Gamma|Beta|Delta|Alpha] DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/scratch/bc06026/workdir/sub.sh
   WorkDir=/scratch/bc06026/workdir
   StdErr=/scratch/bc06026/workdir/testjob_4707896.err
   StdIn=/dev/null
   StdOut=/scratch/bc06026/workdir/testjob_4707896.out
   Power=
   MailUser=bc06026@uga.edu MailType=ALL
   NtasksPerTRES:0

As you can see, there is a little bit more information presented here than is with squeue and sq, such as showing the path to the job's working directory, Slurm job output file path(s), email information, etc...



Back to Top


Previously Ran Jobs

sacct and sacct-gacrc

The easiest way to monitor previously ran jobs is with the Slurm sacct command. Like most Slurm commands, you are able to control the columns displayed in the output of this command (see man sacct for more information). To save you that trouble and to make things more convenient, we've created the sacct-gacrc command, which is sacct but pre-formatted and with some additional options for convenience.

A big difference between squeue/sq and sacct/sacct-gacrc is that by default, sacct/sacct-gacrc without any options only shows you YOUR Jobs. Another important note about sacct/sacct-gacrc is that by default it will display Slurm job steps. Unless you're dividing your job into steps with srun, you probably will want sacct/sacct-gacrc to display one line per job (hide job steps, only show job allocation). To do this, use the -X option. For more information on Slurm job allocation, please see the Slurm documentation.


The default sacct columns are as follows:

JobID    JobName  Partition    Account  AllocCPUS      State ExitCode

Using sacct-gacrc runs the sacct command but provides the following columns:

JobID         JobName      User  Partition NNode NCPUS   ReqMem    CPUTime    Elapsed  Timelimit      State ExitCode   NodeList

As you can see, you're able to get much more useful information with sacct-gacrc than with just the default sacct formatting.


Output Columns Explained

  • JobID: The unique ID of the job.
  • JobName: The name of the job. If not specified in one's submission script, it will default to the name of the submission script (e.g. "sub.sh").
  • User: The user who submitted the job.
  • Partition: The partition to which the job was sent (e.g. batch, highmem_p, gpu_p, etc...).
  • NNode: The number of nodes allocated to the job.
  • NCPUS: The number of CPU cores allocated to the job.
  • ReqMem: The amount of memory allocated to the job.
  • Elapsed: How much (wall) time has elapsed since the job started, in the format DAYS-HOURS:MINUTES:SECONDS
  • Timelimit: The maximum time given for the job to run, in the format DAYS-HOURS:MINUTES:SECONDS.
  • State: The job's state (e.g. Running, Pending, etc...).
  • ExitCode: The job's exit code.
  • Nodelist: The name of the node(s) on which the job is running or ran.


sacct-gacrc also has a -h/--help option:

bc06026@ss-sub3 ~$ sacct-gacrc --help

Usage: sacct-gacrc [OPTIONS]

Description: preformatted wrapper for sacct.  See man sacct for more information. 

    -E, --endtime               Display information about jobs up to a date, in the format of yyyy-mm-dd (default: now)
    -j, --jobs                  Display information about a particular job or jobs (comma-separated list if more than one job)
    -r, --partition             Display information about jobs from a particular partition
    -S, --starttime             Display information about jobs starting from a date in the format of yyyy-mm-dd (default: Midnight of today)
    -u, --user                  Display information about a particular user's job(s) (default: current user)
    -X, --allocations           Only show one line per job (do not display job steps)
    --debug                     Display the sacct command being executed
    -h, --help                  Display this help output


Examples

  • See information about all of your jobs that started from midnight up to now: sacct-gacrc
  • See information about a particular job: sacct-gacrc -j JOBID (replacing JOBID with a particular job ID)
  • See information about all of your jobs that started from midnight up to now in the highmem_p: sacct-gacrc -r highmem_p
  • See information about your jobs that from a particular date up to now: sacct-gacrc -S YYYY-MM-DD (replacing YYYY-MM-DD with a date, e.g. 2021-09-01)


Example sacct-gacrc output:

bc06026@b1-24 ~$ sacct-gacrc -X -S 2021-09-14
       JobID    JobName      User  Partition   NodeList AllocNodes NTask NCPUS  ReqMem  MaxVMSize      State    CPUTime    Elapsed  Timelimit ExitCode 
------------ ---------- --------- ---------- ---------- ---------- ----- ----- ------- ---------- ---------- ---------- ---------- ---------- -------- 
4580375        interact   bc06026  highmem_p     ra4-22          1           1   200Gn            FAILED       00:00:07   00:00:07   12:00:00      1:0 
4580382        interact   bc06026  highmem_p      d1-22          1          28   200Gn            COMPLETED    00:03:16   00:00:07   12:00:00      0:0 
4584992        interact   bc06026    inter_p      c4-16          1           1     2Gn            COMPLETED    00:00:18   00:00:18   12:00:00      0:0 

...

Back to Top


seff

insert info/examples here



Back to Top