Best Practices on Sapelo2: Difference between revisions
No edit summary |
No edit summary |
||
Line 56: | Line 56: | ||
* As mentioned earlier, if you continue use add more cores or memory to your job, eventually you will hit a point of diminishing returns, meaning that requesting more of a particular resource will no longer provide much of a benefit, but will cause your job to take longer to start. | * As mentioned earlier, if you continue use add more cores or memory to your job, eventually you will hit a point of diminishing returns, meaning that requesting more of a particular resource will no longer provide much of a benefit, but will cause your job to take longer to start. | ||
* If you don't specify a particular processor type with the --constraint Slurm header, the speed of a job could vary depending on whether or not it gets allocated to a newer or older processor type. For example, when submitting jobs to the batch cluster, one job may be dispatched a node with an Intel Broadwell processor, while another may end up on a node with the newer Intel Skylake processor. This could have a noticeable effect on the total walltime of your job. You may also encounter software that is optimized for different types of CPU’s. For example, there is some software that is optimized for Intel processors and thus would not perform as well if ran on a node with an AMD processor. Users do have the ability to specify which types of nodes they want to run their job on beyond specifying only a | * If you don't specify a particular processor type with the --constraint Slurm header, the speed of a job could vary depending on whether or not it gets allocated to a newer or older processor type. For example, when submitting jobs to the batch cluster, one job may be dispatched a node with an Intel Broadwell processor, while another may end up on a node with the newer Intel Skylake processor. This could have a noticeable effect on the total walltime of your job. You may also encounter software that is optimized for different types of CPU’s. For example, there is some software that is optimized for Intel processors and thus would not perform as well if ran on a node with an AMD processor. Users do have the ability to specify which types of nodes they want to run their job on beyond specifying only a partition, but please note that this can lead to a longer period time waiting for your job to start. | ||
Revision as of 13:37, 28 October 2020
Job Resource Tuning
Fine-tuning the appropriate amount of resources requested for a job can take some trial and error. There are many variables that must be considered. Here we will look at what should be considered when requesting computing resources for our jobs, and then outline general guidelines to help ensure we’re requesting the optimal amount of resources when submitting a job to an HPC cluster. We will also discuss data points we can observe while a job is running and once it’s finished to better fine-tune our jobs’ resources.
When submitting a job to an HPC cluster, there are four main types of resources that we request. To make sure we’re requesting the optimal amount, first we need to make sure we have a proper understanding of each type of resource.
- Node – computer within a cluster.
- Core – unit of a processor that performs calculations.
- Memory – the amount of random access memory (RAM) needed for a job.
- Walltime limit – the maximum amount of human time available to a job.
Factors to be Considered
Job Success or Failure
In some cases, the amount of resources requested can have a direct effect on whether or not a job succeeds or fails. When considering this, we mainly need to look at memory and the walltime limit. If you request a lower amount of memory than your data set actually needs, your job will likely fail. Similarly, with the walltime limit, any process in your job will be stopped and the allocated resources will be released once the requested walltime limit has been exceeded. The good news is that if you realize your job is going to need more time than it has left, you may submit a ticket to us here, and we can extend the walltime of an in-progress job.
The only scenario in which the amount of nodes requested could have a direct impact on the success or failure of a job is in a scenario in which software was called in a way to take advantage of multiple nodes and did not have them available. For example, if a program expected to pass data from one node to another, but you only requested one node for your job.
Requesting a suboptimal amount of cores will often only affect the success or failure of your job in an indirect way, possibly causing your job to run for longer than your requested walltime if you don’t request enough cores. There are some cases, however, in which requesting too few cores could cause your job to fail. For example, if you have an R package that forms a “cluster” of 4 cores, it may need an extra core to orchestrate the workflow and manage those four cores.
Job Speed
For the speed of your job, every type of resource except the walltime limit will have an effect. In most cases, requesting more than one node will not speed up your job. However, if you are using software that utilizes technology such as MPI or any inter-node communication, you could see some performance benefit from requesting multiple nodes.
Requesting more cores will typically have a noticeable effect on the speed of most jobs, to a point. Nowadays, most software is able to take advantage of multiple cores. That is to say, it can do or process more than one thing at a time. Not all software is able to take advantage of multiple cores in this way, either because of the nature of the problem the software is solving, or because the software was not written to take advantage of multiple cores. An example of a problem that could not be parallelized so to speak, would be a program that computed the Fibonacci sequence. Because of the nature of that computation, adding numbers sequentially, there is no other way to do it other than adding one pair of numbers at a time (serially). On the other hand, dividing up iterations of a loop is an example of something that could potentially be performed across multiple cores at the same time. Keep in mind that a lot of software will have its own command line parameter to specify the number of processes that will be used. This will often be a -p or a -t, but consult the documentation for the software you're using. If this software you're using has something like that, you would want to make sure you passed the same value as you did to --cpus-per-task to the corresponding command line parameter to make sure all of the cores you requested for your job are used. Depending on your software, eventually you will reach a point of diminishing returns. This means that if you were to request more and more cores, eventually you would reach a point where you no longer received any speed benefit in allocating more cores to your job.
Requesting more memory can also have an effect on the overall speed of your job, also to a certain point. This can be a factor of the size of your data set, the number of objects created in your code, as well as how well the software cleans up objects that are no longer in use. If you request too little memory, meaning that more data could have potentially been loaded in memory for the CPU to access at any time, your job will take longer as more time is spent waiting for data in memory to become accessible. Like cores, eventually you will reach a point where you no longer gain any additional speed by requesting more memory.
Time to Job Start
One very common issue when submitting jobs to an HPC cluster is having to wait a long period of time for one’s job to start. Although the cluster has an extremely high amount of resources, they are shared resources and must be distributed equitably. There are very advanced queueing systems on HPC clusters that have the purpose of maximizing the efficiency of resources and getting everyone’s job started as soon as possible. All four of the main types of resources will have an effect on how soon our job can start. As a rule of thumb, the more resources you request for your job, the longer you will have to wait for it to start.
As mentioned earlier, most software will not take advantage of multiple nodes. Unnecessarily requesting more than one node will not cause your job to fail or slow it down, but it will take longer for the queueing software to get multiple nodes for your job to use. Likewise, with cores and memory, the main downside of requesting too much of these resources is having to wait longer for the queueing software to allocate those resources to your job.
While not as initially intuitive as with the other resources, requesting higher amounts of walltime will cause the start of your job to be delayed. The reason for this is because of a technique queueing software will use to allocate resources called “backfilling.” This process looks at the hypothetical latest end time of jobs pending to start and attempts to allocate reserved resources to those jobs which will finish before the resources are needed for another pending job. For example, say we have job A currently running on a node with 32 cores, using 16 cores that will finish in two hours from now. A user submits job B that needs 32 cores, so the queueing software reserves cores on this node for Job B to be allocated once job A finishes in two hours. Immediately after job B goes in queue waiting for its resources, another user submits job C which only needs 16 cores, and a walltime of one hour. Because the queueing software knows that job C would finish before job A finishes and job B can start, it is able to allocate these 16 unused but reserved cores to job C.
This is an example of how HPC queueing software will attempt to “squeeze in” jobs so to speak to keep things moving as efficiently as possible. Many jobs are allocated resources in this way. If you were to submit a job that requested say weeks of walltime for example, it will be virtually impossible for your job to be backfilled and will require waiting for resources to become available to be reserved for the maximum hypothetical walltime of your job.
Fine-tuning your Job’s Resources
With Slurm, fine-tuning your job's resources is fairly simple with the seff
command (The same information will be in Slurm emails if you use the Slurm email headers). Using this command and supplying a job ID as argument will show you the requested resource information as well as resource consumption & efficiency information. By referencing that information, you'll be able to adjust the requested resource amounts in your submission scripts to ensure you're using an optimal amount. Here are some general guidelines to keep in mind when fine-tuning a job's resources:
- Read the documentation of your software as it may have resource recommendations.
- Read the documentation and help output (often retrieved by calling software with no arguments, or sometimes with -h or --help) of your software to verify if the software you're using has any options for specifying the number of cores to be used. Typically when software has an option like this, it will default to only one core if not specified, regardless of how many cores have been allocated for your job. These options are often -p or -t, but will vary from software to software.
- Don't make any major inferences from one run of a job. Record several instances of the time your job took to run relative to resources requested to get an idea of how well your job is performing.
- As mentioned earlier, if you continue use add more cores or memory to your job, eventually you will hit a point of diminishing returns, meaning that requesting more of a particular resource will no longer provide much of a benefit, but will cause your job to take longer to start.
- If you don't specify a particular processor type with the --constraint Slurm header, the speed of a job could vary depending on whether or not it gets allocated to a newer or older processor type. For example, when submitting jobs to the batch cluster, one job may be dispatched a node with an Intel Broadwell processor, while another may end up on a node with the newer Intel Skylake processor. This could have a noticeable effect on the total walltime of your job. You may also encounter software that is optimized for different types of CPU’s. For example, there is some software that is optimized for Intel processors and thus would not perform as well if ran on a node with an AMD processor. Users do have the ability to specify which types of nodes they want to run their job on beyond specifying only a partition, but please note that this can lead to a longer period time waiting for your job to start.
Data Workflow Management
Home directories have a per user quota and have snapshots. Snapshots are like backups in that they are read-only moment-in-time captures of files and directories which can be used to restore files that may have been accidentally deleted or overwritten. If files are created and deleted with frequency, the snapshots will grow and might end up using a lot of space in the overall home file system.
The recommended data workflow is to have files in the home directory *change* as little as possible. These should be databases, applications that you use frequently but do not need to modify that often and other things that you, primarily, read from. Think of snapshots as the memory of the files that were stored there - no matter if you add, change or delete the files, the total sum of that activity will build up over time and may exceed your quota.
The recommended data workflow will have jobs write output files, including intermediate data, such as checkpoint files, and final results into the /scratch file system. Final results should then be transferred out of the /scratch file system, if these are not needed for other jobs that are being submitted soon. Any intermediate files and data not needed for other jobs to be submitted soon should be deleted immediately from the /scratch area.
Files that are needed for many jobs continuously or at some time intervals, possibly by multiple users within a group, such as reference data and model data, can be stored in the group's /work directory.
Single node jobs that do not need to write large output files, but that need to access the files often (for example, to write small amounts of data into disk), can benefit from using a compute node's local hard drive, /lscratch. Jobs that use /lscratch should request the amount of space in /lscratch.
The recommended data workflow is to have data not needed for current jobs, but that are still needed for future jobs on the cluster, be transferred into the project file system and deleted from the /scratch area.