Job Resource Tuning

From Research Computing Center Wiki
Jump to navigation Jump to search


Job Resource Tuning

Fine-tuning the appropriate amount of resources requested for a job can take some trial and error. There are many variables that must be considered. Here we will look at what should be considered when requesting computing resources for our jobs, and then outline general guidelines to help ensure we’re requesting the optimal amount of resources when submitting a job to an HPC cluster. We will also discuss data points we can observe while a job is running and once it’s finished to better fine-tune our jobs’ resources.

When submitting a job to an HPC cluster, there are four main types of resources that we request. To make sure we’re requesting the optimal amount, first we need to make sure we have a proper understanding of each type of resource.

  • Node – computer within a cluster.
  • Core – unit of a processor that performs calculations.
  • Memory – the amount of random access memory (RAM) needed for a job.
  • Walltime limit – the maximum amount of human time available to a job.

Job Success or Failure

In some cases, the amount of resources requested can have a direct effect on whether or not a job succeeds or fails. When considering this, we mainly need to look at memory and the walltime limit. If you request a lower amount of memory than your data set actually needs, your job will likely fail. Similarly, with the walltime limit, any process in your job will be stopped and the allocated resources will be released once the requested walltime limit has been exceeded. The good news is that if you realize your job is going to need more time than it has left, you may submit a ticket to us here, and we can extend the walltime of an in-progress job.

The only scenario in which the amount of nodes requested could have a direct impact on the success or failure of a job is in a scenario in which software was called in a way to take advantage of multiple nodes and did not have them available. For example, if a program expected to pass data from one node to another, but you only requested one node for your job.

Requesting a suboptimal amount of cores will often only affect the success or failure of your job in an indirect way, possibly causing your job to run for longer than your requested walltime if you don’t request enough cores. There are some cases, however, in which requesting too few cores could cause your job to fail. For example, if you have an R package that forms a “cluster” of 4 cores, it may need an extra core to orchestrate the workflow and manage those four cores.

Job Speed

For the speed of your job, every type of resource except the walltime limit will have an effect. In most cases, requesting more than one node will not speed up your job. However, if you are using software that utilizes technology such as MPI or any inter-node communication, you could see some performance benefit from requesting multiple nodes.

Requesting more cores will typically have a noticeable effect on the speed of most jobs, to a point. Nowadays, most software is able to take advantage of multiple cores. That is to say, it can do or process more than one thing at a time. Not all software is able to take advantage of multiple cores in this way, either because of the nature of the problem the software is solving, or because the software was not written to take advantage of multiple cores. An example of a problem that could not be parallelized so to speak, would be a program that computed the Fibonacci sequence. Because of the nature of that computation, adding numbers sequentially, there is no other way to do it other than adding one pair of numbers at a time (serially). On the other hand, dividing up iterations of a loop is an example of something that could potentially be performed across multiple cores at the same time. Keep in mind that a lot of software will have its own command line parameter to specify the number of processes that will be used. This will often be a -p or a -t, but consult the documentation for the software you're using. If this software you're using has something like that, you would want to make sure you passed the same value as you did to --cpus-per-task to the corresponding command line parameter to make sure all of the cores you requested for your job are used. Depending on your software, eventually you will reach a point of diminishing returns. This means that if you were to request more and more cores, eventually you would reach a point where you no longer received any speed benefit in allocating more cores to your job.

Requesting more memory can also have an effect on the overall speed of your job, also to a certain point. This can be a factor of the size of your data set, the number of objects created in your code, as well as how well the software cleans up objects that are no longer in use. If you request too little memory, meaning that more data could have potentially been loaded in memory for the CPU to access at any time, your job will take longer as more time is spent waiting for data in memory to become accessible. Like cores, eventually you will reach a point where you no longer gain any additional speed by requesting more memory.

Time to Job Start

One very common issue when submitting jobs to an HPC cluster is having to wait a long period of time for one’s job to start. Although the cluster has an extremely high amount of resources, they are shared resources and must be distributed equitably. There are very advanced queueing systems on HPC clusters that have the purpose of maximizing the efficiency of resources and getting everyone’s job started as soon as possible. All four of the main types of resources will have an effect on how soon our job can start. As a rule of thumb, the more resources you request for your job, the longer you will have to wait for it to start.

As mentioned earlier, most software will not take advantage of multiple nodes. Unnecessarily requesting more than one node will not cause your job to fail or slow it down, but it will take longer for the queueing software to get multiple nodes for your job to use. Likewise, with cores and memory, the main downside of requesting too much of these resources is having to wait longer for the queueing software to allocate those resources to your job.

While not as initially intuitive as with the other resources, requesting higher amounts of walltime will cause the start of your job to be delayed. The reason for this is because of a technique queueing software will use to allocate resources called “backfilling.” This process looks at the hypothetical latest end time of jobs pending to start and attempts to allocate reserved resources to those jobs which will finish before the resources are needed for another pending job. For example, say we have job A currently running on a node with 32 cores, using 16 cores that will finish in two hours from now. A user submits job B that needs 32 cores, so the queueing software reserves cores on this node for Job B to be allocated once job A finishes in two hours. Immediately after job B goes in queue waiting for its resources, another user submits job C which only needs 16 cores, and a walltime of one hour. Because the queueing software knows that job C would finish before job A finishes and job B can start, it is able to allocate these 16 unused but reserved cores to job C.

This is an example of how HPC queueing software will attempt to “squeeze in” jobs so to speak to keep things moving as efficiently as possible. Many jobs are allocated resources in this way. If you were to submit a job that requested say weeks of walltime for example, it will be virtually impossible for your job to be backfilled and will require waiting for resources to become available to be reserved for the maximum hypothetical walltime of your job.