Best Practices on Sapelo2: Difference between revisions

From Research Computing Center Wiki
Jump to navigation Jump to search
No edit summary
 
(13 intermediate revisions by 2 users not shown)
Line 1: Line 1:
[[Category:Sapelo2]]
[[Category:Sapelo2]]
[[Category:Sapelo2]]
__TOC__


=Job Resource Tuning=
==Using the Cluster==


Fine-tuning the appropriate amount of resources requested for a job can take some trial and errorThere are many variables that must be considered.  Here we will look at what should be considered when requesting computing resources for our jobs, and then outline general guidelines to help ensure we’re requesting the optimal amount of resources when submitting a job to an HPC clusterWe will also discuss data points we can observe while a job is running and once it’s finished to better fine-tune our jobs’ resources.
* '''Never''' load any software modules, or run any scientific software or script directly on the login nodes (sometimes called the "submit" nodes, as in submitting jobs)Only perform computational research within a submission script that you submit to the cluster with <code>sbatch</code> or an [[Running_Jobs_on_Sapelo2#How_to_open_an_interactive_session | interactive job]]You will know that you are on a login node if your command prompt says MyID@ss-sub-''number''.


When submitting a job to an HPC cluster, there are four main types of resources that we requestTo make sure we’re requesting the optimal amount, first we need to make sure we have a proper understanding of each type of resource.
* When loading software modules, specify the entire module name, including the versionAs newer versions of software are added, the default version that gets loaded when a version is not specified can change, so it is better to be specific.


* Node – computer within a cluster.   
* If your job fails, take a look at the job output files for information on whyPlease see [[Troubleshooting on Sapelo2]] for more tips on job troubleshooting.


* Core – unit of a processor that performs calculations.
* Make sure you are using the transfer nodes (xfer.gacrc.uga.edu) for moving data to and from the cluster.


* Memory – the amount of random access memory (RAM) needed for a job.
* Don't connect to the remote.uga.edu VPN when transferring data to and from the cluster, as it has a bandwidth cap.  The transfer nodes do not require being connected to the VPN.


* Walltime limit – the maximum amount of human time available to a job.
==Data Management==


==Factors to be Considered==
* The files in your '''/home''' directory should be files that *change* as little as possible. These should be databases, applications that you use frequently but do not need to modify that often and other things that you, primarily, '''read from'''.


===Job Success or Failure===
* It is recommended to have jobs write output files, including intermediate data, such as checkpoint files, and final results into the '''/scratch''' file system. Final results should then be transferred out of the /scratch file system, if these are not needed for other jobs that are being submitted soon. Any intermediate files and data not needed for other jobs to be submitted soon should be deleted immediately from the /scratch area. 


In some cases, the amount of resources requested can have a direct effect on whether or not a job succeeds or fails.  When considering this, we mainly need to look at memory and the walltime limit.  If you request a lower amount of memory than your data set actually needs, your job will likely fail.  Similarly, with the walltime limit, any process in your job will be stopped and the allocated resources will be released once the requested walltime limit has been exceeded.  The good news is that if you realize your job is going to need more time than it has left, you may submit a ticket to us [https://uga.teamdynamix.com/TDClient/2060/Portal/Requests/ServiceDet?ID=25844 here], and we can extend the walltime of an in-progress job.
* Files that are needed for many jobs continuously or at some time intervals, possibly by multiple users within a group, such as reference data and model data, can be stored in the group's '''/work''' directory.  


The only scenario in which the amount of nodes requested could have a direct impact on the success or failure of a job is in a scenario in which software was called in a way to take advantage of multiple nodes and did not have them available.  For example, if a program expected to pass data from one node to another, but you only requested one node for your job.
* Single node jobs that do not need to write large output files, but that need to access the files often (for example, to write small amounts of data into disk), can benefit from using a compute node's local hard drive, '''/lscratch'''. Jobs that use /lscratch should request the amount of space in /lscratch.  


Requesting a suboptimal amount of cores will often only affect the success or failure of your job in an indirect way, possibly causing your job to run for longer than your requested walltime if you don’t request enough cores.  There are some cases, however, in which requesting too few cores could cause your job to failFor example, if you have an R package that forms a “cluster” of 4 cores, it may need an extra core to orchestrate the workflow and manage those four cores.
* It is recommended to have data not needed for current jobs, but that are still needed for future jobs on the cluster, be transferred into the '''/project''' file system and deleted from the /scratch areaPlease note that the /project directory is only accessible from the transfer nodes.


===Job Speed===
* Create unique subdirectories in your /scratch directory for different research projects or software being used, rather than having many unrelated files all in one directory.


For the actual walltime of your job, every type of resource except the walltime limit will have an effect.  In most cases, requesting more than one node will not speed up your job.  However, if you are using software that utilizes technology such as MPI or any inter-node communication, you could see some performance benefit from requesting multiple nodes.
* Keep in mind the [[Policies#Policy_Statement_for_SCRATCH_File_System | 30-Day Purge Policy]], as data removed or cleaned up from /scratch cannot be recovered.


Requesting more cores will typically have a noticeable effect on the speed of most jobs, to a point.  Nowadays, most software is able to take advantage of multiple cores.  That is to say, it can do or process more than one thing at a time.  Not all software is able to take advantage of multiple cores in this way, either because of the nature of the problem the software is solving, or because the software was not written to take advantage of multiple cores.  An example of a problem that could not be parallelized so to speak, would be a program that computed the [https://en.wikipedia.org/wiki/Fibonacci_number Fibonacci sequence].  Because of the nature of that computation, adding numbers sequentially, there is no other way to do it other than adding one pair of numbers at a time (serially).  On the other hand, many processes in software nowadays can take advantage of multiple cores, such as dividing up iterations of a loop to be performed across multiple cores at the same time.  Depending on your software, eventually you will reach a point of diminishing returns.  This means that if you were to request more and more cores, eventually you would reach a point where you no longer received any speed benefit in allocating more cores to your job.
* Data in /scratch and /work are not backed up in any way; we recommend that users back up important files by themselves, to avoid loss of data if the storage device experience serious issues.  


Requesting more memory can also have an effect on the overall speed of your job, also to a certain point.  This can be a factor of the size of your data set, the number of objects created in your code, as well as how well the software cleans up objects that are no longer in use. If you request too little memory, meaning that more data could have potentially been loaded in memory for the CPU to access at any time, your job will take longer as more time is spent waiting for data in memory to become accessible.  Like cores, eventually you will reach a point where you no longer gain any additional speed by requesting more memory.
* The recommended way to transfer data between file systems on GACRC resources, or between those and users' local machine, is via the [[Globus]] service. This process in convenient using a browser and it can also be automated with its command line interface.


===Time to Job Start===
* Make sure you are using the transfer nodes (xfer.gacrc.uga.edu) for moving data to and from the cluster. Information on the transfer nodes can be found on the [[Transferring Files]] page.


One very common issue when submitting jobs to an HPC cluster is having to wait a long period of time for one’s job to start.  Although the cluster has an extremely high amount of resources, they are shared resources and must be distributed equitably.  There are very advanced queueing systems on HPC clusters that have the purpose of maximizing the efficiency of resources and getting everyone’s job started as soon as possibleAll four of the main types of resources will have an effect on how soon our job can start.  As a rule of thumb, the more resources you request for your job, the longer you will have to wait for it to start.
* Don't connect to the remote.uga.edu VPN when transferring data to and from the cluster, as it has a bandwidth capThe transfer nodes do not require being connected to the VPN when accessing from off campus.


As mentioned earlier, most software will not take advantage of multiple nodes. Unnecessarily requesting more than one node will not cause your job to fail or slow it down, but it will take longer for the queueing software to get multiple nodes for your job to use. Likewise, with cores and memory, the main downside of requesting too much of these resources is having to wait longer for the queueing software to allocate those resources to your job.
* If you need to back up many files onto the /project filesystem, especially relatively small files, please first create a tarball (.tar file) with those files and then transfer the tarball to /project, to reduce the number of files in /project. The tar command to create a tarball of files in /work or /scratch can be run in a batch job or in an interactive job. Please do not run commands like <code>tar</code> and <code>gzip</code> on the Sapelo2 login nodes.


While not as initially intuitive as with the other resources, requesting higher amounts of walltime will cause the start of your job to be delayed. The reason for this is because of a technique queueing software will use to allocate resources called “backfilling.”  This process looks at the hypothetical latest end time of jobs pending to start and attempts to allocate reserved resources to those jobs which will finish before the resources are needed for another pending job. For example, say we have job A currently running on a node with 32 cores, using 16 cores that will finish in two hours from now. A user submits job B that needs 32 cores, so the queueing software reserves cores on this node for Job B to be allocated once job A finishes in two hours. Immediately after job B goes in queue waiting for its resources, another user submits job C which only needs 16 cores, and a walltime of one hour.  Because the queueing software knows that job C would finish before job A finishes and job B can start, it is able to allocate these 16 unused but reserved cores to job C.
* If you have a directory with many files, and you would like to create a compressed tar (e.g. tar.gz or tar.bz2) file, it is often more efficient to create an uncompressed .tar file first, and then use <code>pigz</code>, a parallel <code>gzip</code> command, to compress the tar file (instead of using e.g. <code>tar -zcvf </code>). You can also use <code>pigz</code> to compress individual files before combining them into a tarball. The <code>pigz</code> command is multithread-enabled and we recommend requesting multiple CPU cores for the interactive or batch job used to run this command.
 
This is an example of how HPC queueing software will attempt to “squeeze in” jobs so to speak to keep things moving as efficiently as possible. Many jobs are allocated resources in this way. If you were to submit a job that requested say weeks of walltime for example, it will be virtually impossible for your job to be backfilled and will require waiting for resources to become available to be reserved for the maximum hypothetical walltime of your job.


==Fine-tuning your Job’s Resources==
==Fine-tuning your Job’s Resources==


===Memory and Walltime===
With Slurm, fine-tuning your job's resources is fairly simple with the <code>seff</code> command (The same information will be in Slurm emails if you use the Slurm email headers).  Using this command and supplying a job ID as argument will show you the requested resource information as well as resource consumption & efficiency informationBy referencing that information, you'll be able to adjust the requested resource amounts in your submission scripts to ensure you're using an optimal amountNote that this command's output should not be considered while a job is still runningHere are some general guidelines to keep in mind when fine-tuning a job's resources:
 
The easiest resources to fine-tune for your job are memory and walltime.  When your job is complete, you can check the amount of memory used by entering the command qstat -f ''jobid'' (replacing ''jobid'' with the ID of your job retrieved from qstat_me or qstat -u ''yourMyID'').  You will want to look for the value listed after "resources_used.vmem" (displayed in KB).  If this number is significantly lower than the amount of memory you requested, you can try lowering the requested amount of memory which will allow it to start sooner on average. 
 
Similarly with walltime, you can check the current walltime of your job with the same qstat -f command, checking the value of "resources_used.walltime" (displayed in hours, minutes, seconds).  If this is significantly less than the walltime you requested for your job, you can try lowering that value as well, which should also allow your job to start sooner on averageKeep in mind that the amount of memory and walltime that your job uses will change depending on the size of your data set.
 
One other important thing to note when it comes to the walltime of your job, is that your results may vary depending on the CPU architecture of the node.  For example, when submitting jobs to the batch cluster, one job may be dispatched a node with an Intel Broadwell processor, while another may end up on a node with the newer Intel Skylake processor.  This could have a noticeable effect on the total walltime of your job.  You may also encounter software that is optimized for different types of CPU’s.  For example, there is some software that is optimized for Intel processors and thus would not perform as well if ran on a node with an AMD processor.  Users do have the ability to specify which types of nodes they want to run their job on beyond specifying only a queue, but please note that this can lead to a longer period time waiting for your job to start.
 
===Cores===
 
Fine-tuning the amount of cores requested for your job is a little bit more difficult than memory and the walltime limit.  The best way to approach this is to consider how well your software parallelizes (how well it takes advantage of multiple cores)We do this by comparing the CPU time (the total CPU time of all cores combined) and the walltime of a job.  We can find the CPU time of a job while it is in process by looking at qstat_me output (Time Use column), as well as the value of "resources_used.cput" in the output of qstat -f ''jobid.''
 
Consider a hypothetical job assigned 1 core that finishes a total walltime of 20 minutes.  The same job is ran again, but this time is assigned 2 cores and finishes with a total walltime of 10 minutesWe could say that this job parallelizes perfectly.  This means that the CPU time divided by the walltime is equal to the amount of cores.  The lower the quotient relative to the amount of cores, the worse it parallelizing.
 
As you take note of how well your job is parallelizing with added cores available, keep in mind the previously mentioned concept of diminishing returns.  It is recommended to pay close attention to this potential drop off in effectiveness, as you could be potentially wasting more time waiting for your job to start than you're gaining from additional cores.
 
===General Tips===


* Read the documentation of your software as it may have resource recommendations.
* Read the documentation of your software as it may have resource recommendations.
Line 70: Line 52:
* Don't make any major inferences from one run of a job.  Record several instances of the time your job took to run relative to resources requested to get an idea of how well your job is performing.
* Don't make any major inferences from one run of a job.  Record several instances of the time your job took to run relative to resources requested to get an idea of how well your job is performing.


* Some, but not all of the qstat -f ''jobid'' output is included in the email you receive at the end of your job by defining your email address after #PBS -M and #PBS -m ae in your submission script.
* As mentioned in [[Job Resource Tuning]], if you continue use add more cores or memory to your job, eventually you will hit a point of diminishing returns, meaning that requesting more of a particular resource will no longer provide much of a benefit, but will cause your job to take longer to start.
 
 
=Data Workflow Management=
 
Home directories have a per user quota and have snapshots. Snapshots are like backups in that they are read-only moment-in-time captures of files and directories which can be used to restore files that may have been accidentally deleted or overwritten. If files are created and deleted with frequency, the snapshots will grow and might end up using a lot of space in the overall home file system.
 
The recommended data workflow is to have files in the home directory *change* as little as possible. These should be databases, applications that you use frequently but do not need to modify that often and other things that you, primarily, '''read from'''. Think of snapshots as the memory of the files that were stored there - no matter if you add, change or delete the files, the total sum of that activity will build up over time and may exceed your quota.
 
The recommended data workflow will have jobs write output files, including intermediate data, such as checkpoint files, and final results into the /scratch file system. Final results should then be transferred out of the /scratch file system, if these are not needed for other jobs that are being submitted soon. Any intermediate files and data not needed for other jobs to be submitted soon should be deleted immediately from the /scratch area. 
 
Files that are needed for many jobs continuously or at some time intervals, possibly by multiple users within a group, such as reference data and model data, can be stored in the group's /work directory.  


Single node jobs that do not need to write large output files, but that need to access the files often (for example, to write small amounts of data into disk), can benefit from using a compute node's local hard drive, /lscratch. Jobs that use /lscratch should request the amount of space in /lscratch.  
* The size of your input data and the number of threads you use in your job can have a direct effect on the total amount of memory required for the job. Generally as your input data gets larger and the number of threads used increases, more memory will be required.


The recommended data workflow is to have data not needed for current jobs, but that are still needed for future jobs on the cluster, be transferred into the project file system and deleted from the /scratch area.
* If you don't specify a particular processor type with the --constraint Slurm header, the speed of a job could vary depending on whether or not it gets allocated to a newer or older processor type.  For example, when submitting jobs to the batch partition, one job may be dispatched a node with an Intel Broadwell processor, while another may end up on a node with the newer Intel Skylake processor. This could have a noticeable effect on the total walltime of your job. You may also encounter software that is optimized for different types of CPU’s. For example, there is some software that is optimized for Intel processors and thus would not perform as well if ran on a node with an AMD processor. Users do have the ability to specify which types of nodes they want to run their job on beyond specifying only a partition, but please note that this can lead to a longer period time waiting for your job to start.
__FORCETOC__

Latest revision as of 14:28, 29 August 2024

Using the Cluster

  • Never load any software modules, or run any scientific software or script directly on the login nodes (sometimes called the "submit" nodes, as in submitting jobs). Only perform computational research within a submission script that you submit to the cluster with sbatch or an interactive job. You will know that you are on a login node if your command prompt says MyID@ss-sub-number.
  • When loading software modules, specify the entire module name, including the version. As newer versions of software are added, the default version that gets loaded when a version is not specified can change, so it is better to be specific.
  • If your job fails, take a look at the job output files for information on why. Please see Troubleshooting on Sapelo2 for more tips on job troubleshooting.
  • Make sure you are using the transfer nodes (xfer.gacrc.uga.edu) for moving data to and from the cluster.
  • Don't connect to the remote.uga.edu VPN when transferring data to and from the cluster, as it has a bandwidth cap. The transfer nodes do not require being connected to the VPN.

Data Management

  • The files in your /home directory should be files that *change* as little as possible. These should be databases, applications that you use frequently but do not need to modify that often and other things that you, primarily, read from.
  • It is recommended to have jobs write output files, including intermediate data, such as checkpoint files, and final results into the /scratch file system. Final results should then be transferred out of the /scratch file system, if these are not needed for other jobs that are being submitted soon. Any intermediate files and data not needed for other jobs to be submitted soon should be deleted immediately from the /scratch area.
  • Files that are needed for many jobs continuously or at some time intervals, possibly by multiple users within a group, such as reference data and model data, can be stored in the group's /work directory.
  • Single node jobs that do not need to write large output files, but that need to access the files often (for example, to write small amounts of data into disk), can benefit from using a compute node's local hard drive, /lscratch. Jobs that use /lscratch should request the amount of space in /lscratch.
  • It is recommended to have data not needed for current jobs, but that are still needed for future jobs on the cluster, be transferred into the /project file system and deleted from the /scratch area. Please note that the /project directory is only accessible from the transfer nodes.
  • Create unique subdirectories in your /scratch directory for different research projects or software being used, rather than having many unrelated files all in one directory.
  • Keep in mind the 30-Day Purge Policy, as data removed or cleaned up from /scratch cannot be recovered.
  • Data in /scratch and /work are not backed up in any way; we recommend that users back up important files by themselves, to avoid loss of data if the storage device experience serious issues.
  • The recommended way to transfer data between file systems on GACRC resources, or between those and users' local machine, is via the Globus service. This process in convenient using a browser and it can also be automated with its command line interface.
  • Make sure you are using the transfer nodes (xfer.gacrc.uga.edu) for moving data to and from the cluster. Information on the transfer nodes can be found on the Transferring Files page.
  • Don't connect to the remote.uga.edu VPN when transferring data to and from the cluster, as it has a bandwidth cap. The transfer nodes do not require being connected to the VPN when accessing from off campus.
  • If you need to back up many files onto the /project filesystem, especially relatively small files, please first create a tarball (.tar file) with those files and then transfer the tarball to /project, to reduce the number of files in /project. The tar command to create a tarball of files in /work or /scratch can be run in a batch job or in an interactive job. Please do not run commands like tar and gzip on the Sapelo2 login nodes.
  • If you have a directory with many files, and you would like to create a compressed tar (e.g. tar.gz or tar.bz2) file, it is often more efficient to create an uncompressed .tar file first, and then use pigz, a parallel gzip command, to compress the tar file (instead of using e.g. tar -zcvf ). You can also use pigz to compress individual files before combining them into a tarball. The pigz command is multithread-enabled and we recommend requesting multiple CPU cores for the interactive or batch job used to run this command.

Fine-tuning your Job’s Resources

With Slurm, fine-tuning your job's resources is fairly simple with the seff command (The same information will be in Slurm emails if you use the Slurm email headers). Using this command and supplying a job ID as argument will show you the requested resource information as well as resource consumption & efficiency information. By referencing that information, you'll be able to adjust the requested resource amounts in your submission scripts to ensure you're using an optimal amount. Note that this command's output should not be considered while a job is still running. Here are some general guidelines to keep in mind when fine-tuning a job's resources:

  • Read the documentation of your software as it may have resource recommendations.
  • Read the documentation and help output (often retrieved by calling software with no arguments, or sometimes with -h or --help) of your software to verify if the software you're using has any options for specifying the number of cores to be used. Typically when software has an option like this, it will default to only one core if not specified, regardless of how many cores have been allocated for your job. These options are often -p or -t, but will vary from software to software.
  • Don't make any major inferences from one run of a job. Record several instances of the time your job took to run relative to resources requested to get an idea of how well your job is performing.
  • As mentioned in Job Resource Tuning, if you continue use add more cores or memory to your job, eventually you will hit a point of diminishing returns, meaning that requesting more of a particular resource will no longer provide much of a benefit, but will cause your job to take longer to start.
  • The size of your input data and the number of threads you use in your job can have a direct effect on the total amount of memory required for the job. Generally as your input data gets larger and the number of threads used increases, more memory will be required.
  • If you don't specify a particular processor type with the --constraint Slurm header, the speed of a job could vary depending on whether or not it gets allocated to a newer or older processor type. For example, when submitting jobs to the batch partition, one job may be dispatched a node with an Intel Broadwell processor, while another may end up on a node with the newer Intel Skylake processor. This could have a noticeable effect on the total walltime of your job. You may also encounter software that is optimized for different types of CPU’s. For example, there is some software that is optimized for Intel processors and thus would not perform as well if ran on a node with an AMD processor. Users do have the ability to specify which types of nodes they want to run their job on beyond specifying only a partition, but please note that this can lead to a longer period time waiting for your job to start.