Monitoring Jobs on the teaching cluster: Difference between revisions

From Research Computing Center Wiki
Jump to navigation Jump to search
No edit summary
 
(4 intermediate revisions by the same user not shown)
Line 1: Line 1:
[[Category:Sapelo2]]
[[Category:Teaching]]


===How to list all user jobs in the queues===
===How to list all user jobs in the queues===
Line 10: Line 10:
Sample '''squeue''' output:
Sample '''squeue''' output:
<pre class="gcomment">
<pre class="gcomment">
             JOBID PARTITION    NAME    USER ST      TIME  NODES NODELIST(REASON)
             JOBID   PARTITION    NAME    USER ST      TIME  NODES NODELIST(REASON)
               274     batch slurmenv  ab12345 PD      0:00      1 (Dependency)
               274       batch slurmenv  ab12345 PD      0:00      1 (Dependency)
               278   interq     bash  shtsai  R      0:04      1 c2-4
               278 interactive     bash  shtsai  R      0:04      1 rb1-11
               277     batch  mpitest  shtsai  R      2:05      2 c2-[11-12]
               277       batch  mpitest  shtsai  R      2:05      2 rb1-[9-10]
               276     batch slurmenv  shtsai  R      0:08      1 c1-39
               276       batch slurmenv  shtsai  R      0:08      1 rb1-6
               273     batch slurmenv  ab12345  R      0:44      1 c1-38
               273       batch slurmenv  ab12345  R      0:44      1 rb1-7
</pre>
</pre>


Line 22: Line 22:
The column entitled '''JOBID''' gives the job id of each job.
The column entitled '''JOBID''' gives the job id of each job.


The column entitled '''PARTITION''' gives the partition or queue where the job is running. A job in the interq partition is an interactive session (e.g. one that was launched with the '''qlogin''' command).
The column entitled '''PARTITION''' gives the partition or queue where the job is running. A job in the interactive partition is an interactive session (e.g. one that was launched with the '''interact''' command).


The column entitled '''NAME''' gives the name of the job (specified in the job submission script with the --job-name or -J option.
The column entitled '''NAME''' gives the name of the job (specified in the job submission script with the --job-name or -J option.
Line 55: Line 55:
Sample output (as ran by user shtsai):
Sample output (as ran by user shtsai):
<pre class="gcomment">
<pre class="gcomment">
             JOBID PARTITION    NAME    USER ST      TIME  NODES NODELIST(REASON)
             JOBID   PARTITION    NAME    USER ST      TIME  NODES NODELIST(REASON)
               278   interq     bash  shtsai  R      0:04      1 c2-4
               278 interactive     bash  shtsai  R      0:04      1 rb1-11
               277     batch  mpitest  shtsai  R      2:05      2 c2-[11-12]
               277       batch  mpitest  shtsai  R      2:05      2 rb1-[9-10]
               276     batch slurmenv  shtsai  R      0:08      1 c1-39
               276       batch slurmenv  shtsai  R      0:08      1 rb1-6
</pre>
</pre>


Line 93: Line 93:
   Partition=batch AllocNode:Sid=c2-4:872
   Partition=batch AllocNode:Sid=c2-4:872
   ReqNodeList=(null) ExcNodeList=(null)
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=c2-[11-12]
   NodeList=rb1-[7-8]
   BatchHost=c2-11
   BatchHost=rb1-7
   NumNodes=2 NumCPUs=24 NumTasks=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   NumNodes=2 NumCPUs=24 NumTasks=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=24,mem=14400M,node=2,billing=24
   TRES=cpu=24,mem=14400M,node=2,billing=24
Line 123: Line 123:
       JobID    JobName      User  Partition        NodeList AllocNodes  NTasks      NCPUS    ReqMem  MaxVMSize      State    CPUTime    Elapsed  Timelimit ExitCode              WorkDir  
       JobID    JobName      User  Partition        NodeList AllocNodes  NTasks      NCPUS    ReqMem  MaxVMSize      State    CPUTime    Elapsed  Timelimit ExitCode              WorkDir  
------------ ---------- --------- ---------- --------------- ---------- -------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- -------- --------------------  
------------ ---------- --------- ---------- --------------- ---------- -------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- -------- --------------------  
275            slurmenv    shtsai      batch          c1-38         1                  2      10Gn            COMPLETED  00:00:00  00:00:00  02:00:00      0:0 /home/shtsai/tests/+  
275            slurmenv    shtsai      batch          rb1-5         1                  2      10Gn            COMPLETED  00:00:00  00:00:00  02:00:00      0:0 /home/shtsai/tests/+  
275.batch        batch                                c1-38         1        1          2      10Gn    197240K  COMPLETED  00:00:00  00:00:00                0:0                       
275.batch        batch                                rb1-5         1        1          2      10Gn    197240K  COMPLETED  00:00:00  00:00:00                0:0                       
275.extern      extern                                c1-38         1        1          2      10Gn    169800K  COMPLETED  00:00:00  00:00:00                0:0                       
275.extern      extern                                rb1-5         1        1          2      10Gn    169800K  COMPLETED  00:00:00  00:00:00                0:0                       
276            slurmenv    shtsai      batch          c1-38         1                  1      10Gn            CANCELLED+  00:03:19  00:03:19  02:00:00      0:0 /home/shtsai/tests/+  
276            slurmenv    shtsai      batch          rb1-6         1                  1      10Gn            CANCELLED+  00:03:19  00:03:19  02:00:00      0:0 /home/shtsai/tests/+  
276.batch        batch                                c1-38         1        1          1      10Gn    221140K  CANCELLED  00:03:20  00:03:20                0:15                       
276.batch        batch                                rb1-6         1        1          1      10Gn    221140K  CANCELLED  00:03:20  00:03:20                0:15                       
276.extern      extern                                c1-38         1        1          1      10Gn    169800K  COMPLETED  00:03:19  00:03:19                0:0                       
276.extern      extern                                rb1-6         1        1          1      10Gn    169800K  COMPLETED  00:03:19  00:03:19                0:0                       
277            mpitest    shtsai      batch      c2-[11-12]          2                  24      600Mc            COMPLETED  04:01:12  00:10:03  02:00:00      0:0 /home/shtsai/tests/+  
277            mpitest    shtsai      batch      rb1-[9-10]          2                  24      600Mc            COMPLETED  04:01:12  00:10:03  02:00:00      0:0 /home/shtsai/tests/+  
277.batch        batch                                c2-11         1        1        12      600Mc    221268K  COMPLETED  02:00:36  00:10:03                0:0                       
277.batch        batch                                rb1-9         1        1        12      600Mc    221268K  COMPLETED  02:00:36  00:10:03                0:0                       
277.extern      extern                          c2-[11-12]          2        2        24      600Mc    169800K  COMPLETED  04:01:12  00:10:03                0:0                       
277.extern      extern                          rb1-[9-10]          2        2        24      600Mc    169800K  COMPLETED  04:01:12  00:10:03                0:0                       
277.0            orted                                c2-12         1        1          1      600Mc    265640K  COMPLETED  00:00:01  00:00:01                0:0                       
277.0            orted                                rb1-10         1        1          1      600Mc    265640K  COMPLETED  00:00:01  00:00:01                0:0                       
278                bash    shtsai    interq            c2-4          1                  1        2Gn              RUNNING  00:13:37  00:13:37  12:00:00      0:0 /home/shtsai/tests/+  
278                bash    shtsai    interq            rb1-4          1                  1        2Gn              RUNNING  00:13:37  00:13:37  12:00:00      0:0 /home/shtsai/tests/+  
278.extern      extern                                c2-4          1        1          1        2Gn              RUNNING  00:13:37  00:13:37                0:0                       
278.extern      extern                                rb1-4          1        1          1        2Gn              RUNNING  00:13:37  00:13:37                0:0                       
278.0              bash                                c2-4          1        1          1        2Gn              RUNNING  00:13:37  00:13:37                0:0       
278.0              bash                                rb1-4          1        1          1        2Gn              RUNNING  00:13:37  00:13:37                0:0       
</pre>
</pre>


Line 142: Line 142:
To see all fields reported by the accounting command for a given job, use the command
To see all fields reported by the accounting command for a given job, use the command
<pre class="gcommand">
<pre class="gcommand">
sacct -l -j <u>jobid</u>
sacct -l -j JOBID
</pre>
</pre>
where <u>jobid</u> should be replaced by the '''JOBID''' for the job you wish to check (the JOBID is given in the first column of the '''squeue''' output).
where JOBID should be replaced by the '''JOBID''' for the job you wish to check (the JOBID is given in the first column of the '''squeue''' output).




Line 159: Line 159:


Sample output of the '''squeue -l''' command for an array job that has JOBID=81 and that has 10 elements:
Sample output of the '''squeue -l''' command for an array job that has JOBID=81 and that has 10 elements:
would show:


<pre class="gcomment">
<pre class="gcomment">
Fri Aug  3 16:38:03 2018
Fri Aug  11 16:38:03 2023
             JOBID PARTITION    NAME    USER    STATE      TIME TIME_LIMI  NODES NODELIST(REASON)
             JOBID PARTITION    NAME    USER    STATE      TIME TIME_LIMI  NODES NODELIST(REASON)
               81_0    batch arrayjob  shtsai  RUNNING      0:03  10:00:00      1 c2-11
               81_0    batch arrayjob  shtsai  RUNNING      0:03  10:00:00      1 rb1-7
               81_1    batch arrayjob  shtsai  RUNNING      0:03  10:00:00      1 c2-11
               81_1    batch arrayjob  shtsai  RUNNING      0:03  10:00:00      1 rb1-7
               81_2    batch arrayjob  shtsai  RUNNING      0:03  10:00:00      1 c2-11
               81_2    batch arrayjob  shtsai  RUNNING      0:03  10:00:00      1 rb1-7
               81_3    batch arrayjob  shtsai  RUNNING      0:03  10:00:00      1 c2-11
               81_3    batch arrayjob  shtsai  RUNNING      0:03  10:00:00      1 rb1-7
               81_4    batch arrayjob  shtsai  RUNNING      0:03  10:00:00      1 c2-11
               81_4    batch arrayjob  shtsai  RUNNING      0:03  10:00:00      1 rb1-7
               81_5    batch arrayjob  shtsai  RUNNING      0:03  10:00:00      1 c2-11
               81_5    batch arrayjob  shtsai  RUNNING      0:03  10:00:00      1 rb1-7
               81_6    batch arrayjob  shtsai  RUNNING      0:03  10:00:00      1 c2-11
               81_6    batch arrayjob  shtsai  RUNNING      0:03  10:00:00      1 rb1-7
               81_7    batch arrayjob  shtsai  RUNNING      0:03  10:00:00      1 c2-11
               81_7    batch arrayjob  shtsai  RUNNING      0:03  10:00:00      1 rb1-7
               81_8    batch arrayjob  shtsai  RUNNING      0:03  10:00:00      1 c2-11
               81_8    batch arrayjob  shtsai  RUNNING      0:03  10:00:00      1 rb1-7
               81_9    batch arrayjob  shtsai  RUNNING      0:03  10:00:00      1 c2-11
               81_9    batch arrayjob  shtsai  RUNNING      0:03  10:00:00      1 rb1-7


</pre>
</pre>

Latest revision as of 20:52, 5 September 2023


How to list all user jobs in the queues

To list all running and pending jobs (by all users), use the command

squeue

Sample squeue output:

            JOBID   PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               274       batch slurmenv  ab12345 PD       0:00      1 (Dependency)
               278 interactive     bash   shtsai  R       0:04      1 rb1-11
               277       batch  mpitest   shtsai  R       2:05      2 rb1-[9-10]
               276       batch slurmenv   shtsai  R       0:08      1 rb1-6
               273       batch slurmenv  ab12345  R       0:44      1 rb1-7

Output explained:

The column entitled JOBID gives the job id of each job.

The column entitled PARTITION gives the partition or queue where the job is running. A job in the interactive partition is an interactive session (e.g. one that was launched with the interact command).

The column entitled NAME gives the name of the job (specified in the job submission script with the --job-name or -J option.

The column entitled USER gives the username (MyID) of the user running the job.

The column entitled ST gives the status of a job, which could be

  • R : job is running
  • PD : job is pending, waiting for resources to become available or for a job dependency to be satisfied
  • CG : job is being cleared (completed, was deleted, crashed) and is no longer running.

The column entitled TIME gives the walltime of the job (this is not the CPU time of the job).

The column entitled NODES specifies how many nodes are being used by the job.

The column entitled NODELIST(REASON) lists the hostnames of the nodes used by running jobs. For pending jobs, this column lists the reason the job is pending. For example Dependency means the job is waiting for a job dependency to be satisfied.

The command

squeue -l

adds an extra column in the output for the TIME_LIMIT of the jobs.

How to list only my jobs in the queues

To list all your running and pending jobs, use the command

squeue -u MyID

where MyID needs to be replaced by your own cluster username (UGA MyID).

Sample output (as ran by user shtsai):

            JOBID   PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               278 interactive     bash   shtsai  R       0:04      1 rb1-11
               277       batch  mpitest   shtsai  R       2:05      2 rb1-[9-10]
               276       batch slurmenv   shtsai  R       0:08      1 rb1-6

How to determine on which node(s) jobs are running

To see which nodes were assigned to each or your jobs, use the command

squeue

The column entitled NODELIST lists the hostnames of the nodes allocated to the jobs.


How to see detailed information about a given job

To see a long listing with detailed information for a running job (STATE = R), including the initial working directory, the number of cores and memory requested, the job submission and start time, use the command

scontrol show job JOBID

where JOBID should be replaced by the JOBID for the job you wish to check (the JOBID is given in the first column of the squeue output).

Sample output of a long job listing scontrol show job 11279:

JobId=11279 JobName=mpitest
   UserId=ab12345(10012) GroupId=abclab(1001) MCS_label=N/A
   Priority=1 Nice=0 Account=gacrc QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:02:24 TimeLimit=02:00:00 TimeMin=N/A
   SubmitTime=2018-08-14T12:19:23 EligibleTime=2018-08-14T12:19:23
   StartTime=2018-08-14T12:19:23 EndTime=2018-08-14T14:19:23 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2018-08-14T12:19:23
   Partition=batch AllocNode:Sid=c2-4:872
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=rb1-[7-8]
   BatchHost=rb1-7
   NumNodes=2 NumCPUs=24 NumTasks=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=24,mem=14400M,node=2,billing=24
   Socks/Node=* NtasksPerN:B:S:C=12:0:*:* CoreSpec=*
   MinCPUsNode=12 MinMemoryCPU=600M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/ab12345/tests/slurm/mpi/sub.sh
   WorkDir=/home/ab12345/tests/slurm/mpi
   StdErr=/home/ab12345/tests/slurm/mpi/mpitest.o11279
   StdIn=/dev/null
   StdOut=/home/shtsai/tests/slurm/mpi/mpitest.o11279
   Power=

The long listing of information is not available any more for jobs that are no longer in a running state.

Note that the long listing above does not include the resources utilized by the job.

For your own jobs that are still running or that already completed, you can check how much memory was used, how much CPU time was used, the working directory, etc, please use the command

sacct_zh

Sample output of an sacct_zh command:

       JobID    JobName      User  Partition        NodeList AllocNodes   NTasks      NCPUS     ReqMem  MaxVMSize      State    CPUTime    Elapsed  Timelimit ExitCode              WorkDir 
------------ ---------- --------- ---------- --------------- ---------- -------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- -------- -------------------- 
275            slurmenv    shtsai      batch           rb1-5          1                   2       10Gn             COMPLETED   00:00:00   00:00:00   02:00:00      0:0 /home/shtsai/tests/+ 
275.batch         batch                                rb1-5          1        1          2       10Gn    197240K  COMPLETED   00:00:00   00:00:00                 0:0                      
275.extern       extern                                rb1-5          1        1          2       10Gn    169800K  COMPLETED   00:00:00   00:00:00                 0:0                      
276            slurmenv    shtsai      batch           rb1-6          1                   1       10Gn            CANCELLED+   00:03:19   00:03:19   02:00:00      0:0 /home/shtsai/tests/+ 
276.batch         batch                                rb1-6          1        1          1       10Gn    221140K  CANCELLED   00:03:20   00:03:20                0:15                      
276.extern       extern                                rb1-6          1        1          1       10Gn    169800K  COMPLETED   00:03:19   00:03:19                 0:0                      
277             mpitest    shtsai      batch      rb1-[9-10]          2                  24      600Mc             COMPLETED   04:01:12   00:10:03   02:00:00      0:0 /home/shtsai/tests/+ 
277.batch         batch                                rb1-9          1        1         12      600Mc    221268K  COMPLETED   02:00:36   00:10:03                 0:0                      
277.extern       extern                           rb1-[9-10]          2        2         24      600Mc    169800K  COMPLETED   04:01:12   00:10:03                 0:0                      
277.0             orted                                rb1-10          1        1          1      600Mc    265640K  COMPLETED   00:00:01   00:00:01                 0:0                      
278                bash    shtsai     interq            rb1-4          1                   1        2Gn               RUNNING   00:13:37   00:13:37   12:00:00      0:0 /home/shtsai/tests/+ 
278.extern       extern                                 rb1-4          1        1          1        2Gn               RUNNING   00:13:37   00:13:37                 0:0                      
278.0              bash                                 rb1-4          1        1          1        2Gn               RUNNING   00:13:37   00:13:37                 0:0       

Note that each job will have several entries in this output, and these correspond to the different steps of the job. In general, non-MPI jobs that are still running will have two entries listed (jobid and jobid.extern), non-MPI jobs that completed or that were cancelled will have 3 entries (jobid, jobid.batch, jobid.extern), and MPI jobs will have an extra entry (jobid.0).

To see all fields reported by the accounting command for a given job, use the command

sacct -l -j JOBID

where JOBID should be replaced by the JOBID for the job you wish to check (the JOBID is given in the first column of the squeue output).



Back to Top

Monitoring Array Jobs

To view an array job, use the command

squeue -l

Sample output of the squeue -l command for an array job that has JOBID=81 and that has 10 elements:

Fri Aug  11 16:38:03 2023
             JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
              81_0     batch arrayjob   shtsai  RUNNING       0:03  10:00:00      1 rb1-7
              81_1     batch arrayjob   shtsai  RUNNING       0:03  10:00:00      1 rb1-7
              81_2     batch arrayjob   shtsai  RUNNING       0:03  10:00:00      1 rb1-7
              81_3     batch arrayjob   shtsai  RUNNING       0:03  10:00:00      1 rb1-7
              81_4     batch arrayjob   shtsai  RUNNING       0:03  10:00:00      1 rb1-7
              81_5     batch arrayjob   shtsai  RUNNING       0:03  10:00:00      1 rb1-7
              81_6     batch arrayjob   shtsai  RUNNING       0:03  10:00:00      1 rb1-7
              81_7     batch arrayjob   shtsai  RUNNING       0:03  10:00:00      1 rb1-7
              81_8     batch arrayjob   shtsai  RUNNING       0:03  10:00:00      1 rb1-7
              81_9     batch arrayjob   shtsai  RUNNING       0:03  10:00:00      1 rb1-7


Back to Top