Best Practices on Sapelo2: Difference between revisions

From Research Computing Center Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 42: Line 42:


* If you don't specify a particular processor type with the --constraint Slurm header, the speed of a job could vary depending on whether or not it gets allocated to a newer or older processor type.  For example, when submitting jobs to the batch partition, one job may be dispatched a node with an Intel Broadwell processor, while another may end up on a node with the newer Intel Skylake processor. This could have a noticeable effect on the total walltime of your job. You may also encounter software that is optimized for different types of CPU’s. For example, there is some software that is optimized for Intel processors and thus would not perform as well if ran on a node with an AMD processor. Users do have the ability to specify which types of nodes they want to run their job on beyond specifying only a partition, but please note that this can lead to a longer period time waiting for your job to start.
* If you don't specify a particular processor type with the --constraint Slurm header, the speed of a job could vary depending on whether or not it gets allocated to a newer or older processor type.  For example, when submitting jobs to the batch partition, one job may be dispatched a node with an Intel Broadwell processor, while another may end up on a node with the newer Intel Skylake processor. This could have a noticeable effect on the total walltime of your job. You may also encounter software that is optimized for different types of CPU’s. For example, there is some software that is optimized for Intel processors and thus would not perform as well if ran on a node with an AMD processor. Users do have the ability to specify which types of nodes they want to run their job on beyond specifying only a partition, but please note that this can lead to a longer period time waiting for your job to start.
__FORCETOC__

Revision as of 13:45, 2 July 2021

Using the Cluster

  • Never load any software modules, or run any scientific software or script directly on the login nodes (sometimes called the "submit" nodes, as in submitting jobs). Only perform computational research within a submission script that you submit to the cluster with sbatch or an interactive job. You will know that you are on a login node if your command prompt says MyID@ss-sub-number.
  • When loading software modules, specify the entire module name, including the version. As newer versions of software are added, the default version that gets loaded when a version is not specified can change, so it is better to be specific.
  • If your job fails, take a look at the job output files for information on why. Please see Troubleshooting on Sapelo2 for more tips on job troubleshooting.
  • Make sure you are using the transfer nodes (xfer.gacrc.uga.edu) for moving data to and from the cluster.
  • Don't connect to the remote.uga.edu VPN when transferring data to and from the cluster, as it has a bandwidth cap. The transfer nodes do not require being connected to the VPN.

Data Management

  • The files in your /home directory should be files that *change* as little as possible. These should be databases, applications that you use frequently but do not need to modify that often and other things that you, primarily, read from.
  • It is recommended to have jobs write output files, including intermediate data, such as checkpoint files, and final results into the /scratch file system. Final results should then be transferred out of the /scratch file system, if these are not needed for other jobs that are being submitted soon. Any intermediate files and data not needed for other jobs to be submitted soon should be deleted immediately from the /scratch area.
  • Files that are needed for many jobs continuously or at some time intervals, possibly by multiple users within a group, such as reference data and model data, can be stored in the group's /work directory.
  • Single node jobs that do not need to write large output files, but that need to access the files often (for example, to write small amounts of data into disk), can benefit from using a compute node's local hard drive, /lscratch. Jobs that use /lscratch should request the amount of space in /lscratch.
  • It is recommended to have data not needed for current jobs, but that are still needed for future jobs on the cluster, be transferred into the /project file system and deleted from the /scratch area. Please note that the /project directory is only accessible from the transfer nodes.
  • Create unique subdirectories in your /scratch directory for different research projects or software being used, rather than having many unrelated files all in one directory.
  • Keep in mind the 30-Day Purge Policy, as data removed or cleaned up from /scratch cannot be recovered.

Fine-tuning your Job’s Resources

With Slurm, fine-tuning your job's resources is fairly simple with the seff command (The same information will be in Slurm emails if you use the Slurm email headers). Using this command and supplying a job ID as argument will show you the requested resource information as well as resource consumption & efficiency information. By referencing that information, you'll be able to adjust the requested resource amounts in your submission scripts to ensure you're using an optimal amount. Note that this command's output should not be considered while a job is still running. Here are some general guidelines to keep in mind when fine-tuning a job's resources:

  • Read the documentation of your software as it may have resource recommendations.
  • Read the documentation and help output (often retrieved by calling software with no arguments, or sometimes with -h or --help) of your software to verify if the software you're using has any options for specifying the number of cores to be used. Typically when software has an option like this, it will default to only one core if not specified, regardless of how many cores have been allocated for your job. These options are often -p or -t, but will vary from software to software.
  • Don't make any major inferences from one run of a job. Record several instances of the time your job took to run relative to resources requested to get an idea of how well your job is performing.
  • As mentioned in Job Resource Tuning, if you continue use add more cores or memory to your job, eventually you will hit a point of diminishing returns, meaning that requesting more of a particular resource will no longer provide much of a benefit, but will cause your job to take longer to start.
  • If you don't specify a particular processor type with the --constraint Slurm header, the speed of a job could vary depending on whether or not it gets allocated to a newer or older processor type. For example, when submitting jobs to the batch partition, one job may be dispatched a node with an Intel Broadwell processor, while another may end up on a node with the newer Intel Skylake processor. This could have a noticeable effect on the total walltime of your job. You may also encounter software that is optimized for different types of CPU’s. For example, there is some software that is optimized for Intel processors and thus would not perform as well if ran on a node with an AMD processor. Users do have the ability to specify which types of nodes they want to run their job on beyond specifying only a partition, but please note that this can lead to a longer period time waiting for your job to start.