Tmp: Difference between revisions

From Research Computing Center Wiki
Jump to navigation Jump to search
No edit summary
(Blanked the page)
Tag: Blanking
 
(3 intermediate revisions by the same user not shown)
Line 1: Line 1:
__TOC__


= Pending or Running Jobs =
==<code>squeue</code> and <code>sq</code>==
The easiest way to monitor pending or running jobs is with the Slurm <code>squeue</code> command.  Like most Slurm commands, you are able to control the columns displayed in the output of this command (see <code>man squeue</code> for more information).  To save you that trouble and to make things more convenient, we've created the <code>sq</code> command, which is <code>squeue</code> but pre-formatted and with some additional options for convenience. 
The key thing to remember about <code>squeue</code>/<code>sq</code> is that without any options, it shows ALL currently running and pending jobs on the cluster.  In order to show only your currently running and pending jobs, you will want to use the <code>--me</code> option.
The default <code>squeue</code> columns are as follows:
<pre class="gcomment">
JOBID PARTITION    NAME    USER ST      TIME  NODES NODELIST(REASON)
</pre>
Using <code>sq</code> runs the <code>squeue</code> command but provides the following columns:
<pre class="gcomment">
JOBID      NAME            PARTITION        USER      NODES  CPUS  MIN_MEMORY  PRIORITY  TIME            TIME_LIMIT      STATE      NODELIST(REASON)
</pre>
As you can see, you're able to get much more useful information with <code>sq</code> than with just the default <code>squeue</code> formatting. 
'''Output Columns Explained'''
* '''JOBID''': The unique ID of the job (for an array job, it will be of the form "<base_job_id>_<index>").
* '''NAME''': The name of the job.  If not specified in one's submission script, it will default to the name of the submission script (e.g. "sub.sh").
* '''PARTITION''': The partition to which the job was sent (e.g. batch, highmem_p, gpu_p, etc...).
* '''USER''': The user who submitted the job.
* '''NODES''': The number of nodes allocated to the job.
* '''CPUS''': The number of CPU cores allocated to the job.
* '''MIN_MEMORY''': The total amount of memory allocated to the job.
* '''PRIORITY''': The job's priority per Slurm's [https://slurm.schedmd.com/priority_multifactor.html Multifactor Priority Plugin]
* '''TIME''': How much (wall) time has elapsed since the job started, in the format DAYS-HOURS:MINUTES:SECONDS
* '''TIME_LIMIT''': The maximum time given for the job to run, in the format DAYS-HOURS:MINUTES:SECONDS.
* '''STATE''': The job's state (e.g. Running, Pending, etc...)
* '''NODELIST(REASON)''': The name of the node(s) on which the job is running or the reason the job has not started yet, if it is pending.
<code>sq</code> also has a -h/--help option:
<pre class="gcomment">
bc06026@ss-sub3 ~$ sq --help
Usage: sq [OPTIONS]
Descriptions: sq - preformatted wrapper for squeue.  See man squeue for more information.
    -j                          Displays squeue output for a given job
    --me                        Displays squeue output for the user executing this command
    -p                          Displays squeue output for a given partition
    -u                          Displays squeue output for a given user
    -T                          Displays submit and start time columns
    -h, --help                  Displays this help output
</pre>
<big><big>'''Examples'''</big></big>
* See all pending and running jobs: <code>sq</code>
* See all of your pending and running jobs: <code>sq --me</code>
* See all pending and running jobs in the highmem_p: <code>sq -p highmem_p</code>
* See all of your pending and running jobs in the batch partition: <code>sq --me -p batch</code>
* See all of your pending and  running jobs including submit time and start time columns: <code>sq --me -T</code> (Note, this will require a wide monitor or small font to display without columns wrapping around)
<big>'''Example <code>sq</code> output:'''</big>
<pre class="gcomment">
bc06026@ss-sub3 ~$ sq
JOBID   NAME         PARTITION USER   NODES   CPUS   MIN_MEMORY PRIORITY    TIME   TIME_LIMIT STATE   NODELIST(REASON)
4581410   Bowtie2-test batch         zp21982   1   1   12G         6003     2:10:56   10:00:00 RUNNING   c5-4
4584815   test-job highmem_p rt12352   1   12   300G         5473     1:51:03   2:00:00 RUNNING   d3-9
4578428   PR6_Cd3 batch         un12354   1   1   40G         5449     4:57:15   1-2:00:00 RUNNING   c4-16
4583491   interact inter_p         ai38821   1   4   2G         5428     1:57:38   12:00:00 RUNNING   d5-21
4580374   BLAST         batch         gh98762   1   1   10G         5397     2:54:41   12:00:00 RUNNING   b1-9
...
</pre>
<big>'''Example <code>sq</code> output for an array job:'''</big>
<pre class="gcomment">
bc06026@b1-24 workdir$ sq --me
JOBID        NAME            PARTITION        USER      NODES  CPUS  MIN_MEMORY  PRIORITY  TIME            TIME_LIMIT      STATE      NODELIST(REASON)
4711132_4    array-example  batch            bc06026    1      1      1G          5993      0:06            1:00:00        RUNNING    c5-19
4711132_3    array-example  batch            bc06026    1      1      1G          5993      0:06            1:00:00        RUNNING    c5-19
4711132_2    array-example  batch            bc06026    1      1      1G          5993      0:06            1:00:00        RUNNING    c5-19
4711132_1    array-example  batch            bc06026    1      1      1G          5993      0:06            1:00:00        RUNNING    c4-15
4711132_0    array-example  batch            bc06026    1      1      1G          5993      0:06            1:00:00        RUNNING    c4-15
</pre>
----
[[#top|Back to Top]]
== <code>scontrol show job</code> ==
Another option for viewing information about a pending or running job is <code>scontrol show job ''JOBID''</code>, replacing ''JOBID'' with your job's ID.  This command will display information about a pending or running job with one or more key/value pairs of information about the job per line, one line after another (as opposed to the row/column output of <code>squeue</code> and <code>sq</code>).  This command will return output while the job is pending or running, and only for a few moments after the job has completed.  Using this command with a job that has already finished (more than a few moments ago) will return the output "slurm_load_jobs error: Invalid job id specified".  This just means that the job is too old for <code>scontrol show job</code> to display information about it.
<big>'''Example <code>scontrol show job</code> output:'''</big>
<pre class="gcomment">
bc06026@b1-24 workdir$ scontrol show job 4707896
JobId=4707896 JobName=testjob
  UserId=bc06026(3356) GroupId=gacrc-appadmin(21003) MCS_label=N/A
  Priority=5993 Nice=0 Account=gacrc-instruction QOS=normal
  JobState=RUNNING Reason=None Dependency=(null)
  Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
  RunTime=00:00:23 TimeLimit=01:00:00 TimeMin=N/A
  SubmitTime=2021-09-17T09:53:30 EligibleTime=2021-09-17T09:53:30
  AccrueTime=2021-09-17T09:53:30
  StartTime=2021-09-17T09:53:30 EndTime=2021-09-17T10:53:30 Deadline=N/A
  SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-09-17T09:53:30
  Partition=batch AllocNode:Sid=b1-24:36515
  ReqNodeList=(null) ExcNodeList=(null)
  NodeList=c4-15
  BatchHost=c4-15
  NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
  TRES=cpu=1,mem=4G,node=1,billing=1
  Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
  MinCPUsNode=1 MinMemoryNode=4G MinTmpDiskNode=0
  Features=[Gamma|Beta|Delta|Alpha] DelayBoot=00:00:00
  OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
  Command=/scratch/bc06026/workdir/sub.sh
  WorkDir=/scratch/bc06026/workdir
  StdErr=/scratch/bc06026/workdir/testjob_4707896.err
  StdIn=/dev/null
  StdOut=/scratch/bc06026/workdir/testjob_4707896.out
  Power=
  MailUser=bc06026@uga.edu MailType=ALL
  NtasksPerTRES:0
</pre>
As you can see, there is a little bit more information presented here than is with <code>squeue</code> and <code>sq</code>, such as showing the path to the job's working directory, Slurm job output file path(s), email information, etc...
----
[[#top|Back to Top]]
= Finished Jobs =
== <code>sacct</code> and <code>sacct-gacrc</code> ==
The easiest way to monitor finished jobs is with the Slurm <code>sacct</code> command.  Like most Slurm commands, you are able to control the columns displayed in the output of this command (see <code>man sacct</code> for more information).  To save you that trouble and to make things more convenient, we've created the <code>sacct-gacrc</code> command, which is <code>sacct</code> but pre-formatted and with some additional options for convenience. 
A big difference between <code>squeue</code>/<code>sq</code> and <code>sacct</code>/<code>sacct-gacrc</code> is that by default, <code>sacct</code>/<code>sacct-gacrc</code> without any options only shows you YOUR Jobs.  Another important note about <code>sacct</code>/<code>sacct-gacrc</code> is that by default it will display Slurm job ''steps''.  Unless you're dividing your job into steps with <code>srun</code>, you probably will want <code>sacct</code>/<code>sacct-gacrc</code> to display one line per job (hide job steps, only show job allocation).  To do this, use the <code>-X</code> option.  For more information on Slurm job allocation, please see the Slurm [https://slurm.schedmd.com/job_launch.html documentation].
The default <code>sacct</code> columns are as follows:
<pre class="gcomment">
JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
</pre>
Using <code>sacct-gacrc</code> runs the <code>sacct</code> command but provides the following columns:
<pre class="gcomment">
JobID        JobName      User  Partition NNode NCPUS  ReqMem    CPUTime    Elapsed  Timelimit      State ExitCode  NodeList
</pre>
As you can see, you're able to get much more useful information with <code>sacct-gacrc</code> than with just the default <code>sacct</code> formatting. 
'''Output Columns Explained'''
* '''JobID''': The unique ID of the job.
* '''JobName''': The name of the job.  If not specified in one's submission script, it will default to the name of the submission script (e.g. "sub.sh").
* '''User''': The user who submitted the job.
* '''Partition''': The partition to which the job was sent (e.g. batch, highmem_p, gpu_p, etc...).
* '''NNode''': The number of nodes allocated to the job.
* '''NCPUS''': The number of CPU cores allocated to the job.
* '''ReqMem''': The amount of memory allocated to the job.
* '''Elapsed''': How much (wall) time has elapsed since the job started, in the format DAYS-HOURS:MINUTES:SECONDS
* '''Timelimit''': The maximum time given for the job to run, in the format DAYS-HOURS:MINUTES:SECONDS.
* '''State''': The job's state (e.g. Running, Pending, etc...).
* '''ExitCode''': The job's exit code.
* '''Nodelist''': The name of the node(s) on which the job is running or ran.
<code>sacct-gacrc</code> also has a -h/--help option:
<pre class="gcomment">
bc06026@ss-sub3 ~$ sacct-gacrc --help
Usage: sacct-gacrc [OPTIONS]
Description: preformatted wrapper for sacct.  See man sacct for more information.
    -E, --endtime              Display information about jobs up to a date, in the format of yyyy-mm-dd (default: now)
    -j, --jobs                  Display information about a particular job or jobs (comma-separated list if more than one job)
    -r, --partition            Display information about jobs from a particular partition
    -S, --starttime            Display information about jobs starting from a date in the format of yyyy-mm-dd (default: Midnight of today)
    -u, --user                  Display information about a particular user's job(s) (default: current user)
    -X, --allocations          Only show one line per job (do not display job steps)
    --debug                    Display the sacct command being executed
    -h, --help                  Display this help output
</pre>
<big><big>'''Examples'''</big></big>
* See information about all of your jobs that started from midnight up to now: <code>sacct-gacrc</code>
* See information about a particular job: <code>sacct-gacrc -j JOBID</code> (replacing JOBID with a particular job ID)
* See information about all of your jobs that started from midnight up to now in the highmem_p: <code>sacct-gacrc -r highmem_p</code>
* See information about your jobs that from a particular date up to now: <code>sacct-gacrc -S YYYY-MM-DD</code> (replacing YYYY-MM-DD with a date, e.g. 2021-09-01)
<big>'''Example <code>sacct-gacrc</code> output:'''</big>
<pre class="gcomment">
bc06026@b1-24 ~$ sacct-gacrc -X -S 2021-09-14
      JobID    JobName      User  Partition  NodeList AllocNodes NTask NCPUS  ReqMem  MaxVMSize      State    CPUTime    Elapsed  Timelimit ExitCode
------------ ---------- --------- ---------- ---------- ---------- ----- ----- ------- ---------- ---------- ---------- ---------- ---------- --------
4580375        interact  bc06026  highmem_p    ra4-22          1          1  200Gn            FAILED      00:00:07  00:00:07  12:00:00      1:0
4580382        interact  bc06026  highmem_p      d1-22          1          28  200Gn            COMPLETED    00:03:16  00:00:07  12:00:00      0:0
4584992        interact  bc06026    inter_p      c4-16          1          1    2Gn            COMPLETED    00:00:18  00:00:18  12:00:00      0:0
...
</pre>
----
[[#top|Back to Top]]
== <code>seff</code> ==
<code>seff</code> is a command that can be used to check how efficient a finished job was with the CPU and memory resources it was given.  This is a very useful command as it gives insight into optimizing our job's resources.  It is important to note that <code>seff</code> is only useful after a job has finished.  Using it with a job that is still running will not return an error message, but the CPU and memory usage will be shown as 0.00%, and a warning message will be appended to the bottom of the output that says "WARNING: Efficiency statistics may be misleading for RUNNING jobs.".
<big>'''Example <code>seff</code> output:'''</big>
<pre class="gcomment">
bc06026@b1-24 workdir$ seff 4707896
Job ID: 4707896
Cluster: tc2
User/Group: bc06026/gacrc-appadmin
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:09:25
CPU Efficiency: 93.08% of 00:10:07 core-walltime
Job Wall-clock time: 00:10:07
Memory Utilized: 183.43 MB
Memory Efficiency: 4.48% of 4.00 GB
</pre>
The key information shown in the output above are the last 6 columns.  We see that the job was allocated 1 CPU core, and that core was doing something for 9 minutes and 23 seconds out of the total 10 minutes and 7 seconds of the job's wall-clock (elapsed) time, (93.08% of the job's wall-clock time).  Generally you want to aim for as much utilization of the resources you request as possible, so perhaps this job could run twice as fast given two cores, if the software/command(s) it used parallelized perfectly.  The memory utilized for the job in the output above is very low.  Four gigabytes of RAM is not much at all in the context of high performance computing, but say for example you requested 100 GB of memory and found that your job was only using ~5% of that memory.  In that case you would definitely want to consider lowering the amount of memory you request for that particular job if ran again in the future.  For more information on job resource tuning, please see [[ Best Practices on Sapelo2 ]] and [[ Job Resource Tuning ]].  Please note that if your job ends abruptly because it ran out of memory, the <code>seff</code> memory utilization values may not reflect that, as a sudden spike in memory usage could cause the job to get killed before <code>seff</code> notices the spike in consumed memory.
----
[[#top|Back to Top]]

Latest revision as of 11:27, 17 September 2021