Monitoring Jobs on the teaching cluster: Difference between revisions
No edit summary |
No edit summary |
||
Line 10: | Line 10: | ||
Sample '''squeue''' output: | Sample '''squeue''' output: | ||
<pre class="gcomment"> | <pre class="gcomment"> | ||
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) | JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) | ||
274 | 274 batch slurmenv ab12345 PD 0:00 1 (Dependency) | ||
278 | 278 interactive bash shtsai R 0:04 1 rb1-4 | ||
277 | 277 batch mpitest shtsai R 2:05 2 rb1-[11-12] | ||
276 | 276 batch slurmenv shtsai R 0:08 1 rb1-6 | ||
273 | 273 batch slurmenv ab12345 R 0:44 1 rb1-7 | ||
</pre> | </pre> | ||
Line 22: | Line 22: | ||
The column entitled '''JOBID''' gives the job id of each job. | The column entitled '''JOBID''' gives the job id of each job. | ||
The column entitled '''PARTITION''' gives the partition or queue where the job is running. A job in the | The column entitled '''PARTITION''' gives the partition or queue where the job is running. A job in the interactive partition is an interactive session (e.g. one that was launched with the '''interact''' command). | ||
The column entitled '''NAME''' gives the name of the job (specified in the job submission script with the --job-name or -J option. | The column entitled '''NAME''' gives the name of the job (specified in the job submission script with the --job-name or -J option. | ||
Line 55: | Line 55: | ||
Sample output (as ran by user shtsai): | Sample output (as ran by user shtsai): | ||
<pre class="gcomment"> | <pre class="gcomment"> | ||
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) | JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) | ||
278 | 278 interactive bash shtsai R 0:04 1 rb1-4 | ||
277 | 277 batch mpitest shtsai R 2:05 2 rb1-[11-12] | ||
276 | 276 batch slurmenv shtsai R 0:08 1 rb1-6 | ||
</pre> | </pre> | ||
Line 93: | Line 93: | ||
Partition=batch AllocNode:Sid=c2-4:872 | Partition=batch AllocNode:Sid=c2-4:872 | ||
ReqNodeList=(null) ExcNodeList=(null) | ReqNodeList=(null) ExcNodeList=(null) | ||
NodeList= | NodeList=rb1-[11-12] | ||
BatchHost= | BatchHost=rb1-11 | ||
NumNodes=2 NumCPUs=24 NumTasks=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:* | NumNodes=2 NumCPUs=24 NumTasks=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:* | ||
TRES=cpu=24,mem=14400M,node=2,billing=24 | TRES=cpu=24,mem=14400M,node=2,billing=24 | ||
Line 123: | Line 123: | ||
JobID JobName User Partition NodeList AllocNodes NTasks NCPUS ReqMem MaxVMSize State CPUTime Elapsed Timelimit ExitCode WorkDir | JobID JobName User Partition NodeList AllocNodes NTasks NCPUS ReqMem MaxVMSize State CPUTime Elapsed Timelimit ExitCode WorkDir | ||
------------ ---------- --------- ---------- --------------- ---------- -------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- -------- -------------------- | ------------ ---------- --------- ---------- --------------- ---------- -------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- -------- -------------------- | ||
275 slurmenv shtsai batch | 275 slurmenv shtsai batch rb1-5 1 2 10Gn COMPLETED 00:00:00 00:00:00 02:00:00 0:0 /home/shtsai/tests/+ | ||
275.batch batch | 275.batch batch rb1-5 1 1 2 10Gn 197240K COMPLETED 00:00:00 00:00:00 0:0 | ||
275.extern extern | 275.extern extern rb1-5 1 1 2 10Gn 169800K COMPLETED 00:00:00 00:00:00 0:0 | ||
276 slurmenv shtsai batch | 276 slurmenv shtsai batch rb1-6 1 1 10Gn CANCELLED+ 00:03:19 00:03:19 02:00:00 0:0 /home/shtsai/tests/+ | ||
276.batch batch | 276.batch batch rb1-6 1 1 1 10Gn 221140K CANCELLED 00:03:20 00:03:20 0:15 | ||
276.extern extern | 276.extern extern rb1-6 1 1 1 10Gn 169800K COMPLETED 00:03:19 00:03:19 0:0 | ||
277 mpitest shtsai batch | 277 mpitest shtsai batch rb1-[11-12] 2 24 600Mc COMPLETED 04:01:12 00:10:03 02:00:00 0:0 /home/shtsai/tests/+ | ||
277.batch batch | 277.batch batch rb1-11 1 1 12 600Mc 221268K COMPLETED 02:00:36 00:10:03 0:0 | ||
277.extern extern | 277.extern extern rb1-[11-12] 2 2 24 600Mc 169800K COMPLETED 04:01:12 00:10:03 0:0 | ||
277.0 orted | 277.0 orted rb1-12 1 1 1 600Mc 265640K COMPLETED 00:00:01 00:00:01 0:0 | ||
278 bash shtsai interq | 278 bash shtsai interq rb1-4 1 1 2Gn RUNNING 00:13:37 00:13:37 12:00:00 0:0 /home/shtsai/tests/+ | ||
278.extern extern | 278.extern extern rb1-4 1 1 1 2Gn RUNNING 00:13:37 00:13:37 0:0 | ||
278.0 bash | 278.0 bash rb1-4 1 1 1 2Gn RUNNING 00:13:37 00:13:37 0:0 | ||
</pre> | </pre> | ||
Line 161: | Line 161: | ||
<pre class="gcomment"> | <pre class="gcomment"> | ||
Fri Aug | Fri Aug 11 16:38:03 2023 | ||
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON) | JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON) | ||
81_0 batch arrayjob shtsai RUNNING 0:03 10:00:00 1 | 81_0 batch arrayjob shtsai RUNNING 0:03 10:00:00 1 rb1-7 | ||
81_1 batch arrayjob shtsai RUNNING 0:03 10:00:00 1 | 81_1 batch arrayjob shtsai RUNNING 0:03 10:00:00 1 rb1-7 | ||
81_2 batch arrayjob shtsai RUNNING 0:03 10:00:00 1 | 81_2 batch arrayjob shtsai RUNNING 0:03 10:00:00 1 rb1-7 | ||
81_3 batch arrayjob shtsai RUNNING 0:03 10:00:00 1 | 81_3 batch arrayjob shtsai RUNNING 0:03 10:00:00 1 rb1-7 | ||
81_4 batch arrayjob shtsai RUNNING 0:03 10:00:00 1 | 81_4 batch arrayjob shtsai RUNNING 0:03 10:00:00 1 rb1-7 | ||
81_5 batch arrayjob shtsai RUNNING 0:03 10:00:00 1 | 81_5 batch arrayjob shtsai RUNNING 0:03 10:00:00 1 rb1-7 | ||
81_6 batch arrayjob shtsai RUNNING 0:03 10:00:00 1 | 81_6 batch arrayjob shtsai RUNNING 0:03 10:00:00 1 rb1-7 | ||
81_7 batch arrayjob shtsai RUNNING 0:03 10:00:00 1 | 81_7 batch arrayjob shtsai RUNNING 0:03 10:00:00 1 rb1-7 | ||
81_8 batch arrayjob shtsai RUNNING 0:03 10:00:00 1 | 81_8 batch arrayjob shtsai RUNNING 0:03 10:00:00 1 rb1-7 | ||
81_9 batch arrayjob shtsai RUNNING 0:03 10:00:00 1 | 81_9 batch arrayjob shtsai RUNNING 0:03 10:00:00 1 rb1-7 | ||
</pre> | </pre> |
Revision as of 20:49, 5 September 2023
How to list all user jobs in the queues
To list all running and pending jobs (by all users), use the command
squeue
Sample squeue output:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 274 batch slurmenv ab12345 PD 0:00 1 (Dependency) 278 interactive bash shtsai R 0:04 1 rb1-4 277 batch mpitest shtsai R 2:05 2 rb1-[11-12] 276 batch slurmenv shtsai R 0:08 1 rb1-6 273 batch slurmenv ab12345 R 0:44 1 rb1-7
Output explained:
The column entitled JOBID gives the job id of each job.
The column entitled PARTITION gives the partition or queue where the job is running. A job in the interactive partition is an interactive session (e.g. one that was launched with the interact command).
The column entitled NAME gives the name of the job (specified in the job submission script with the --job-name or -J option.
The column entitled USER gives the username (MyID) of the user running the job.
The column entitled ST gives the status of a job, which could be
- R : job is running
- PD : job is pending, waiting for resources to become available or for a job dependency to be satisfied
- CG : job is being cleared (completed, was deleted, crashed) and is no longer running.
The column entitled TIME gives the walltime of the job (this is not the CPU time of the job).
The column entitled NODES specifies how many nodes are being used by the job.
The column entitled NODELIST(REASON) lists the hostnames of the nodes used by running jobs. For pending jobs, this column lists the reason the job is pending. For example Dependency means the job is waiting for a job dependency to be satisfied.
The command
squeue -l
adds an extra column in the output for the TIME_LIMIT of the jobs.
How to list only my jobs in the queues
To list all your running and pending jobs, use the command
squeue -u MyID
where MyID needs to be replaced by your own cluster username (UGA MyID).
Sample output (as ran by user shtsai):
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 278 interactive bash shtsai R 0:04 1 rb1-4 277 batch mpitest shtsai R 2:05 2 rb1-[11-12] 276 batch slurmenv shtsai R 0:08 1 rb1-6
How to determine on which node(s) jobs are running
To see which nodes were assigned to each or your jobs, use the command
squeue
The column entitled NODELIST lists the hostnames of the nodes allocated to the jobs.
How to see detailed information about a given job
To see a long listing with detailed information for a running job (STATE = R), including the initial working directory, the number of cores and memory requested, the job submission and start time, use the command
scontrol show job JOBID
where JOBID should be replaced by the JOBID for the job you wish to check (the JOBID is given in the first column of the squeue output).
Sample output of a long job listing scontrol show job 11279:
JobId=11279 JobName=mpitest UserId=ab12345(10012) GroupId=abclab(1001) MCS_label=N/A Priority=1 Nice=0 Account=gacrc QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:02:24 TimeLimit=02:00:00 TimeMin=N/A SubmitTime=2018-08-14T12:19:23 EligibleTime=2018-08-14T12:19:23 StartTime=2018-08-14T12:19:23 EndTime=2018-08-14T14:19:23 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2018-08-14T12:19:23 Partition=batch AllocNode:Sid=c2-4:872 ReqNodeList=(null) ExcNodeList=(null) NodeList=rb1-[11-12] BatchHost=rb1-11 NumNodes=2 NumCPUs=24 NumTasks=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=24,mem=14400M,node=2,billing=24 Socks/Node=* NtasksPerN:B:S:C=12:0:*:* CoreSpec=* MinCPUsNode=12 MinMemoryCPU=600M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=(null) Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/home/ab12345/tests/slurm/mpi/sub.sh WorkDir=/home/ab12345/tests/slurm/mpi StdErr=/home/ab12345/tests/slurm/mpi/mpitest.o11279 StdIn=/dev/null StdOut=/home/shtsai/tests/slurm/mpi/mpitest.o11279 Power=
The long listing of information is not available any more for jobs that are no longer in a running state.
Note that the long listing above does not include the resources utilized by the job.
For your own jobs that are still running or that already completed, you can check how much memory was used, how much CPU time was used, the working directory, etc, please use the command
sacct_zh
Sample output of an sacct_zh command:
JobID JobName User Partition NodeList AllocNodes NTasks NCPUS ReqMem MaxVMSize State CPUTime Elapsed Timelimit ExitCode WorkDir ------------ ---------- --------- ---------- --------------- ---------- -------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- -------- -------------------- 275 slurmenv shtsai batch rb1-5 1 2 10Gn COMPLETED 00:00:00 00:00:00 02:00:00 0:0 /home/shtsai/tests/+ 275.batch batch rb1-5 1 1 2 10Gn 197240K COMPLETED 00:00:00 00:00:00 0:0 275.extern extern rb1-5 1 1 2 10Gn 169800K COMPLETED 00:00:00 00:00:00 0:0 276 slurmenv shtsai batch rb1-6 1 1 10Gn CANCELLED+ 00:03:19 00:03:19 02:00:00 0:0 /home/shtsai/tests/+ 276.batch batch rb1-6 1 1 1 10Gn 221140K CANCELLED 00:03:20 00:03:20 0:15 276.extern extern rb1-6 1 1 1 10Gn 169800K COMPLETED 00:03:19 00:03:19 0:0 277 mpitest shtsai batch rb1-[11-12] 2 24 600Mc COMPLETED 04:01:12 00:10:03 02:00:00 0:0 /home/shtsai/tests/+ 277.batch batch rb1-11 1 1 12 600Mc 221268K COMPLETED 02:00:36 00:10:03 0:0 277.extern extern rb1-[11-12] 2 2 24 600Mc 169800K COMPLETED 04:01:12 00:10:03 0:0 277.0 orted rb1-12 1 1 1 600Mc 265640K COMPLETED 00:00:01 00:00:01 0:0 278 bash shtsai interq rb1-4 1 1 2Gn RUNNING 00:13:37 00:13:37 12:00:00 0:0 /home/shtsai/tests/+ 278.extern extern rb1-4 1 1 1 2Gn RUNNING 00:13:37 00:13:37 0:0 278.0 bash rb1-4 1 1 1 2Gn RUNNING 00:13:37 00:13:37 0:0
Note that each job will have several entries in this output, and these correspond to the different steps of the job. In general, non-MPI jobs that are still running will have two entries listed (jobid and jobid.extern), non-MPI jobs that completed or that were cancelled will have 3 entries (jobid, jobid.batch, jobid.extern), and MPI jobs will have an extra entry (jobid.0).
To see all fields reported by the accounting command for a given job, use the command
sacct -l -j JOBID
where JOBID should be replaced by the JOBID for the job you wish to check (the JOBID is given in the first column of the squeue output).
Monitoring Array Jobs
To view an array job, use the command
squeue -l
Sample output of the squeue -l command for an array job that has JOBID=81 and that has 10 elements:
Fri Aug 11 16:38:03 2023 JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON) 81_0 batch arrayjob shtsai RUNNING 0:03 10:00:00 1 rb1-7 81_1 batch arrayjob shtsai RUNNING 0:03 10:00:00 1 rb1-7 81_2 batch arrayjob shtsai RUNNING 0:03 10:00:00 1 rb1-7 81_3 batch arrayjob shtsai RUNNING 0:03 10:00:00 1 rb1-7 81_4 batch arrayjob shtsai RUNNING 0:03 10:00:00 1 rb1-7 81_5 batch arrayjob shtsai RUNNING 0:03 10:00:00 1 rb1-7 81_6 batch arrayjob shtsai RUNNING 0:03 10:00:00 1 rb1-7 81_7 batch arrayjob shtsai RUNNING 0:03 10:00:00 1 rb1-7 81_8 batch arrayjob shtsai RUNNING 0:03 10:00:00 1 rb1-7 81_9 batch arrayjob shtsai RUNNING 0:03 10:00:00 1 rb1-7