Best Practices on Sapelo2: Difference between revisions

From Research Computing Center Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 1: Line 1:
[[Category:Sapelo2]]
[[Category:Sapelo2]]
[[Category:Sapelo2]]


==Data Workflow Management==
=Job Resource Tuning=
 
Fine-tuning the appropriate amount of resources requested for a job can take some trial and error.  There are many variables that must be considered.  Here we will look at what should be considered when requesting computing resources for our jobs, and then outline general guidelines to help ensure we’re requesting the optimal amount of resources when submitting a job to an HPC cluster.  We will also discuss data points we can observe while a job is running and once it’s finished to better fine-tune our jobs’ resources.
 
When submitting a job to an HPC cluster, there are four main types of resources that we request.  To make sure we’re requesting the optimal amount, first we need to make sure we have a proper understanding of each type of resource.
 
* Node – computer within a cluster. 
 
* Core – unit of a processor that performs calculations.
 
* Memory – the amount of random access memory (RAM) needed for a job.
 
* Walltime limit – the maximum amount of human time available to a job.
 
==Factors to be Considered==
 
===Job Success or Failure===
 
In some cases, the amount of resources requested can have a direct effect on whether or not a job succeeds or fails.  When considering this, we mainly need to look at memory and the walltime limit.  If you request a lower amount of memory than your data set actually needs, your job will likely fail.  Similarly, with the walltime limit, any process in your job will be stopped and the allocated resources will be released once the requested walltime limit has been exceeded.  The good news is that if you realize your job is going to need more time than it has left, you may submit a ticket to us [https://uga.teamdynamix.com/TDClient/2060/Portal/Requests/ServiceDet?ID=25844 here], and we can extend the walltime of an in-progress job.
 
The only scenario in which the amount of nodes requested could have a direct impact on the success or failure of a job is in a scenario in which software was called in a way to take advantage of multiple nodes and did not have them available.  For example, if a program expected to pass data from one node to another, but you only requested one node for your job.
 
Requesting a suboptimal amount of cores will often only affect the success or failure of your job in an indirect way, possibly causing your job to run for longer than your requested walltime if you don’t request enough cores.  There are some cases, however, in which requesting too few cores could cause your job to fail.  For example, if you have an R package that forms a “cluster” of 4 cores, it may need an extra core to orchestrate the workflow and manage those four cores.
 
===Job Speed===
 
For the actual walltime of your job, every type of resource except the walltime limit will have an effect.  In most cases, requesting more than one node will not speed up your job.  However, if you are using software that utilizes technology such as MPI or any inter-node communication, you could see some performance benefit from requesting multiple nodes.
 
Requesting more cores will typically have a noticeable effect on the speed of most jobs, to a point.  Nowadays, most software is able to take advantage of multiple cores.  That is to say, it can do or process more than one thing at a time.  Not all software is able to take advantage of multiple cores in this way, either because of the nature of the problem the software is solving, or because the software was not written to take advantage of multiple cores.  An example of a problem that could not be parallelized so to speak, would be a program that computed the [https://en.wikipedia.org/wiki/Fibonacci_number Fibonacci sequence].  Because of the nature of that computation, adding numbers sequentially, there is no other way to do it other than adding one pair of numbers at a time (serially).  On the other hand, many processes in software nowadays can take advantage of multiple cores, such as dividing up iterations of a loop to be performed across multiple cores at the same time.  Depending on your software, eventually you will reach a point of diminishing returns.  This means that if you were to request more and more cores, eventually you would reach a point where you no longer received any speed benefit in allocating more cores to your job.
 
Requesting more memory can also have an effect on the overall speed of your job, also to a certain point.  This can be a factor of the size of your data set, the number of objects created in your code, as well as how well the software cleans up objects that are no longer in use. If you request too little memory, meaning that more data could have potentially been loaded in memory for the CPU to access at any time, your job will take longer as more time is spent waiting for data in memory to become accessible.  Like cores, eventually you will reach a point where you no longer gain any additional speed by requesting more memory.
 
===Time to Job Start===
 
One very common issue when submitting jobs to an HPC cluster is having to wait a long period of time for one’s job to start.  Although the cluster has an extremely high amount of resources, they are shared resources and must be distributed equitably.  There are very advanced queueing systems on HPC clusters that have the purpose of maximizing the efficiency of resources and getting everyone’s job started as soon as possible.  All four of the main types of resources will have an effect on how soon our job can start.  As a rule of thumb, the more resources you request for your job, the longer you will have to wait for it to start.
 
As mentioned earlier, most software will not take advantage of multiple nodes.  Unnecessarily requesting more than one node will not cause your job to fail or slow it down, but it will take longer for the queueing software to get multiple nodes for your job to use.  Likewise, with cores and memory, the main downside of requesting too much of these resources is having to wait longer for the queueing software to allocate those resources to your job. 
 
While not as initially intuitive as with the other resources, requesting higher amounts of walltime will cause the start of your job to be delayed.  The reason for this is because of a technique queueing software will use to allocate resources called “backfilling.”  This process looks at the hypothetical latest end time of jobs pending to start and attempts to allocate reserved resources to those jobs which will finish before the resources are needed for another pending job.  For example, say we have job A currently running on a node with 32 cores, using 16 cores that will finish in two hours from now.  A user submits job B that needs 32 cores, so the queueing software reserves cores on this node for Job B to be allocated once job A finishes in two hours.  Immediately after job B goes in queue waiting for its resources, another user submits job C which only needs 16 cores, and a walltime of one hour.  Because the queueing software knows that job C would finish before job A finishes and job B can start, it is able to allocate these 16 unused but reserved cores to job C.
 
This is an example of how HPC queueing software will attempt to “squeeze in” jobs so to speak to keep things moving as efficiently as possible.  Many jobs are allocated resources in this way.  If you were to submit a job that requested say weeks of walltime for example, it will be virtually impossible for your job to be backfilled and will require waiting for resources to become available to be reserved for the maximum hypothetical walltime of your job.
 
==Fine-tuning your Job’s Resources==
 
===Memory and Walltime===
 
The easiest resources to fine-tune for your job are memory and walltime.  When your job is complete, you can check the amount of memory used by entering the command qstat -f ''jobid'' (replacing ''jobid'' with the ID of your job retrieved from qstat_me or qstat -u ''yourMyID'').  You will want to look for the value listed after "resources_used.vmem" (displayed in KB).  If this number is significantly lower than the amount of memory you requested, you can try lowering the requested amount of memory which will allow it to start sooner on average. 
 
Similarly with walltime, you can check the current walltime of your job with the same qstat -f command, checking the value of "resources_used.walltime" (displayed in hours, minutes, seconds).  If this is significantly less than the walltime you requested for your job, you can try lowering that value as well, which should also allow your job to start sooner on average.  Keep in mind that the amount of memory and walltime that your job uses will change depending on the size of your data set.
 
One other important thing to note when it comes to the walltime of your job, is that your results may vary depending on the CPU architecture of the node.  For example, when submitting jobs to the batch cluster, one job may be dispatched a node with an Intel Broadwell processor, while another may end up on a node with the newer Intel Skylake processor.  This could have a noticeable effect on the total walltime of your job.  You may also encounter software that is optimized for different types of CPU’s.  For example, there is some software that is optimized for Intel processors and thus would not perform as well if ran on a node with an AMD processor.  Users do have the ability to specify which types of nodes they want to run their job on beyond specifying only a queue, but please note that this can lead to a longer period time waiting for your job to start.
 
===Cores===
 
Fine-tuning the amount of cores requested for your job is a little bit more difficult than memory and the walltime limit.  The best way to approach this is to consider how well your software parallelizes (how well it takes advantage of multiple cores).  We do this by comparing the CPU time (the total CPU time of all cores combined) and the walltime of a job.  We can find the CPU time of a job while it is in process by looking at qstat_me output (Time Use column), as well as the value of "resources_used.cput" in the output of qstat -f ''jobid.''
 
Consider a hypothetical job assigned 1 core that finishes a total walltime of 20 minutes.  The same job is ran again, but this time is assigned 2 cores and finishes with a total walltime of 10 minutes.  We could say that this job parallelizes perfectly.  This means that the CPU time divided by the walltime is equal to the amount of cores.  The lower the quotient relative to the amount of cores, the worse it parallelizing.
 
As you take note of how well your job is parallelizing with added cores available, keep in mind the previously mentioned concept of diminishing returns.  It is recommended to pay close attention to this potential drop off in effectiveness, as you could be potentially wasting more time waiting for your job to start than you're gaining from additional cores.
 
===General Tips===
 
* Read the documentation of your software as it may have resource recommendations.
 
* Read the documentation and help output (often retrieved by calling software with no arguments, or sometimes with -h or --help) of your software to verify if the software you're using has any options for specifying the number of cores to be used.  Typically when software has an option like this, it will default to only one core if not specified, regardless of how many cores have been allocated for your job.  These options are often -p or -t, but will vary from software to software.
 
* Don't make any major inferences from one run of a job.  Record several instances of the time your job took to run relative to resources requested to get an idea of how well your job is performing.
 
* Some, but not all of the qstat -f ''jobid'' output is included in the email you receive at the end of your job by defining your email address after #PBS -M and #PBS -m ae in your submission script.
 
 
=Data Workflow Management=


Home directories have a per user quota and have snapshots. Snapshots are like backups in that they are read-only moment-in-time captures of files and directories which can be used to restore files that may have been accidentally deleted or overwritten. If files are created and deleted with frequency, the snapshots will grow and might end up using a lot of space in the overall home file system.
Home directories have a per user quota and have snapshots. Snapshots are like backups in that they are read-only moment-in-time captures of files and directories which can be used to restore files that may have been accidentally deleted or overwritten. If files are created and deleted with frequency, the snapshots will grow and might end up using a lot of space in the overall home file system.

Revision as of 14:34, 3 June 2020


Job Resource Tuning

Fine-tuning the appropriate amount of resources requested for a job can take some trial and error. There are many variables that must be considered. Here we will look at what should be considered when requesting computing resources for our jobs, and then outline general guidelines to help ensure we’re requesting the optimal amount of resources when submitting a job to an HPC cluster. We will also discuss data points we can observe while a job is running and once it’s finished to better fine-tune our jobs’ resources.

When submitting a job to an HPC cluster, there are four main types of resources that we request. To make sure we’re requesting the optimal amount, first we need to make sure we have a proper understanding of each type of resource.

  • Node – computer within a cluster.
  • Core – unit of a processor that performs calculations.
  • Memory – the amount of random access memory (RAM) needed for a job.
  • Walltime limit – the maximum amount of human time available to a job.

Factors to be Considered

Job Success or Failure

In some cases, the amount of resources requested can have a direct effect on whether or not a job succeeds or fails. When considering this, we mainly need to look at memory and the walltime limit. If you request a lower amount of memory than your data set actually needs, your job will likely fail. Similarly, with the walltime limit, any process in your job will be stopped and the allocated resources will be released once the requested walltime limit has been exceeded. The good news is that if you realize your job is going to need more time than it has left, you may submit a ticket to us here, and we can extend the walltime of an in-progress job.

The only scenario in which the amount of nodes requested could have a direct impact on the success or failure of a job is in a scenario in which software was called in a way to take advantage of multiple nodes and did not have them available. For example, if a program expected to pass data from one node to another, but you only requested one node for your job.

Requesting a suboptimal amount of cores will often only affect the success or failure of your job in an indirect way, possibly causing your job to run for longer than your requested walltime if you don’t request enough cores. There are some cases, however, in which requesting too few cores could cause your job to fail. For example, if you have an R package that forms a “cluster” of 4 cores, it may need an extra core to orchestrate the workflow and manage those four cores.

Job Speed

For the actual walltime of your job, every type of resource except the walltime limit will have an effect. In most cases, requesting more than one node will not speed up your job. However, if you are using software that utilizes technology such as MPI or any inter-node communication, you could see some performance benefit from requesting multiple nodes.

Requesting more cores will typically have a noticeable effect on the speed of most jobs, to a point. Nowadays, most software is able to take advantage of multiple cores. That is to say, it can do or process more than one thing at a time. Not all software is able to take advantage of multiple cores in this way, either because of the nature of the problem the software is solving, or because the software was not written to take advantage of multiple cores. An example of a problem that could not be parallelized so to speak, would be a program that computed the Fibonacci sequence. Because of the nature of that computation, adding numbers sequentially, there is no other way to do it other than adding one pair of numbers at a time (serially). On the other hand, many processes in software nowadays can take advantage of multiple cores, such as dividing up iterations of a loop to be performed across multiple cores at the same time. Depending on your software, eventually you will reach a point of diminishing returns. This means that if you were to request more and more cores, eventually you would reach a point where you no longer received any speed benefit in allocating more cores to your job.

Requesting more memory can also have an effect on the overall speed of your job, also to a certain point. This can be a factor of the size of your data set, the number of objects created in your code, as well as how well the software cleans up objects that are no longer in use. If you request too little memory, meaning that more data could have potentially been loaded in memory for the CPU to access at any time, your job will take longer as more time is spent waiting for data in memory to become accessible. Like cores, eventually you will reach a point where you no longer gain any additional speed by requesting more memory.

Time to Job Start

One very common issue when submitting jobs to an HPC cluster is having to wait a long period of time for one’s job to start. Although the cluster has an extremely high amount of resources, they are shared resources and must be distributed equitably. There are very advanced queueing systems on HPC clusters that have the purpose of maximizing the efficiency of resources and getting everyone’s job started as soon as possible. All four of the main types of resources will have an effect on how soon our job can start. As a rule of thumb, the more resources you request for your job, the longer you will have to wait for it to start.

As mentioned earlier, most software will not take advantage of multiple nodes. Unnecessarily requesting more than one node will not cause your job to fail or slow it down, but it will take longer for the queueing software to get multiple nodes for your job to use. Likewise, with cores and memory, the main downside of requesting too much of these resources is having to wait longer for the queueing software to allocate those resources to your job.

While not as initially intuitive as with the other resources, requesting higher amounts of walltime will cause the start of your job to be delayed. The reason for this is because of a technique queueing software will use to allocate resources called “backfilling.” This process looks at the hypothetical latest end time of jobs pending to start and attempts to allocate reserved resources to those jobs which will finish before the resources are needed for another pending job. For example, say we have job A currently running on a node with 32 cores, using 16 cores that will finish in two hours from now. A user submits job B that needs 32 cores, so the queueing software reserves cores on this node for Job B to be allocated once job A finishes in two hours. Immediately after job B goes in queue waiting for its resources, another user submits job C which only needs 16 cores, and a walltime of one hour. Because the queueing software knows that job C would finish before job A finishes and job B can start, it is able to allocate these 16 unused but reserved cores to job C.

This is an example of how HPC queueing software will attempt to “squeeze in” jobs so to speak to keep things moving as efficiently as possible. Many jobs are allocated resources in this way. If you were to submit a job that requested say weeks of walltime for example, it will be virtually impossible for your job to be backfilled and will require waiting for resources to become available to be reserved for the maximum hypothetical walltime of your job.

Fine-tuning your Job’s Resources

Memory and Walltime

The easiest resources to fine-tune for your job are memory and walltime. When your job is complete, you can check the amount of memory used by entering the command qstat -f jobid (replacing jobid with the ID of your job retrieved from qstat_me or qstat -u yourMyID). You will want to look for the value listed after "resources_used.vmem" (displayed in KB). If this number is significantly lower than the amount of memory you requested, you can try lowering the requested amount of memory which will allow it to start sooner on average.

Similarly with walltime, you can check the current walltime of your job with the same qstat -f command, checking the value of "resources_used.walltime" (displayed in hours, minutes, seconds). If this is significantly less than the walltime you requested for your job, you can try lowering that value as well, which should also allow your job to start sooner on average. Keep in mind that the amount of memory and walltime that your job uses will change depending on the size of your data set.

One other important thing to note when it comes to the walltime of your job, is that your results may vary depending on the CPU architecture of the node. For example, when submitting jobs to the batch cluster, one job may be dispatched a node with an Intel Broadwell processor, while another may end up on a node with the newer Intel Skylake processor. This could have a noticeable effect on the total walltime of your job. You may also encounter software that is optimized for different types of CPU’s. For example, there is some software that is optimized for Intel processors and thus would not perform as well if ran on a node with an AMD processor. Users do have the ability to specify which types of nodes they want to run their job on beyond specifying only a queue, but please note that this can lead to a longer period time waiting for your job to start.

Cores

Fine-tuning the amount of cores requested for your job is a little bit more difficult than memory and the walltime limit. The best way to approach this is to consider how well your software parallelizes (how well it takes advantage of multiple cores). We do this by comparing the CPU time (the total CPU time of all cores combined) and the walltime of a job. We can find the CPU time of a job while it is in process by looking at qstat_me output (Time Use column), as well as the value of "resources_used.cput" in the output of qstat -f jobid.

Consider a hypothetical job assigned 1 core that finishes a total walltime of 20 minutes. The same job is ran again, but this time is assigned 2 cores and finishes with a total walltime of 10 minutes. We could say that this job parallelizes perfectly. This means that the CPU time divided by the walltime is equal to the amount of cores. The lower the quotient relative to the amount of cores, the worse it parallelizing.

As you take note of how well your job is parallelizing with added cores available, keep in mind the previously mentioned concept of diminishing returns. It is recommended to pay close attention to this potential drop off in effectiveness, as you could be potentially wasting more time waiting for your job to start than you're gaining from additional cores.

General Tips

  • Read the documentation of your software as it may have resource recommendations.
  • Read the documentation and help output (often retrieved by calling software with no arguments, or sometimes with -h or --help) of your software to verify if the software you're using has any options for specifying the number of cores to be used. Typically when software has an option like this, it will default to only one core if not specified, regardless of how many cores have been allocated for your job. These options are often -p or -t, but will vary from software to software.
  • Don't make any major inferences from one run of a job. Record several instances of the time your job took to run relative to resources requested to get an idea of how well your job is performing.
  • Some, but not all of the qstat -f jobid output is included in the email you receive at the end of your job by defining your email address after #PBS -M and #PBS -m ae in your submission script.


Data Workflow Management

Home directories have a per user quota and have snapshots. Snapshots are like backups in that they are read-only moment-in-time captures of files and directories which can be used to restore files that may have been accidentally deleted or overwritten. If files are created and deleted with frequency, the snapshots will grow and might end up using a lot of space in the overall home file system.

The recommended data workflow is to have files in the home directory *change* as little as possible. These should be databases, applications that you use frequently but do not need to modify that often and other things that you, primarily, read from. Think of snapshots as the memory of the files that were stored there - no matter if you add, change or delete the files, the total sum of that activity will build up over time and may exceed your quota.

The recommended data workflow will have jobs write output files, including intermediate data, such as checkpoint files, and final results into the /scratch file system. Final results should then be transferred out of the /scratch file system, if these are not needed for other jobs that are being submitted soon. Any intermediate files and data not needed for other jobs to be submitted soon should be deleted immediately from the /scratch area.

Files that are needed for many jobs continuously or at some time intervals, possibly by multiple users within a group, such as reference data and model data, can be stored in the group's /work directory.

Single node jobs that do not need to write large output files, but that need to access the files often (for example, to write small amounts of data into disk), can benefit from using a compute node's local hard drive, /lscratch. Jobs that use /lscratch should request the amount of space in /lscratch.

The recommended data workflow is to have data not needed for current jobs, but that are still needed for future jobs on the cluster, be transferred into the project file system and deleted from the /scratch area.