Changes implemented during November 16-22, 2018 maintenance
Sapelo decommissioned
The original Sapelo cluster (sapelo1.gacrc.uga.edu) is now fully decommissioned and the nodes that were originally part of Sapelo were migrated into Sapelo2.
File transfer nodes (xfer) require Archpass Duo
The file transfer nodes (xfer) now require two-factor authentication using Archpass Duo. When connecting to an xfer node from off-campus, please first connect to the UGA VPN and then connect to xfer.gacrc.uga.edu. Some documentation on changes needed in the configuration of FileZilla and WinSCP to accommodate using Archpass Duo is available at Transferring Files.
New hardware added into Sapelo2
In addition to the nodes that were originally part of Sapelo, we have added the following new hardware into Sapelo2:
- 42 compute nodes with Intel Xeon Skylake processors (32 cores and 187GB of RAM per node)
- 4 compute nodes with Intel Xeon Skylake processors (32 cores and 187GB of RAM) and 1 NVIDIA P100 GPU card per node
- DDN SFA14KX Lustre appliance (1.26PB), serving /scratch and /work areas
- An EDR Infiniband network that provides internodal communication among a subset of compute nodes and between these nodes and the storage devices serving the users home directories and the scratch and work directories. The subset of nodes that are connected to the EDR IB fabric comprises the new Intel Skylake nodes and the Intel Broadwell nodes that were already on Sapelo2.
Changes in the storage systems on Sapelo2
/lustre1 decommissioned
The Seagate (Xyratex) ClusterStor1500 Lustre appliance that served /lustre1 has been decommissioned. The new scratch file system is called /scratch and resides in the new DDN SFA14KX Lustre appliance. All files that were in /lustre1 at the start of the maintenance window (5PM Nov. 16th, 2018) were copied into the new /scratch file system.
Jobs that have already been submitted from the /lustre1 directory (i.e. jobs that were pending during the maintenance) will not be affected by the decommissioning of /lustre1, as a link is provided that points /lustre1 to /scratch.
Any new jobs using the scratch file system should be accessed from /scratch and not from /lustre1. The /lustre1 link will be removed after the jobs that were submitted before the maintenance window complete.
Please do not submit any new jobs from /lustre1
If your job submission scripts or other programs have the /lustre1 path hardcoded in them, please change these files to replace /lustre1 by /scratch.
Some applications experienced issues on Sapelo2 when using /lustre1, because /lustre1 did not have the file locking feature enabled on Sapelo2. These issues should not occur on Sapelo2 when using /scratch, as /scratch does have the file locking feature enabled.
/scratch file system and file retention policy
Coming soon: "30-day purge" policy for files in /scratch
/work file system
Each group now has a /work directory (e.g. for abclab group, this directory is called /work/abclab), with a per group quota of 500GB and a maximum number of files limited to 100,000, that can be used to store input files that are often needed for repeated jobs, possibly by multiple users within a group, such as reference data and model data. For more information, please see Disk Storage.
Please also see Best Practices on Sapelo2 for some suggestions on data management workflow.
Software Upgrade
The following software updates were performed:
- The compute nodes' operating system was upgraded from 64-bit Linux CentOS 7.1 to 64-bit Linux CentOS 7.5. Code compiled on the older version of OS should continue to work (no need to recompile).
- The queueing system was updated from Torque 6.1.1 to 6.1.3 and Moab 9.1.2 to 9.1.3 to address some bugs in the older versions.
How parallel jobs using MPI are affected
Because the Intel nodes on Sapelo2 are now connected via a new EDR Infiniband network, some changes in the MPI jobs will be necessary.
- Applications that use OpenMPI will continue to work on both AMD and Intel nodes. However, applications compiled with older versions of OpenMPI (e.g. v. 1.10.3 in the foss/2016b toolchain) will display a warning at runtime, similar to this:
[n249.ecompute:220879] 3 more processes have sent help message help-mpi-btl-openib.txt / no device params found [n249.ecompute:220879] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
or a warning similar to this:
-------------------------------------------------------------------------- WARNING: No preset parameters were found for the device that Open MPI detected: Local host: n275 Device name: mlx5_0 Device vendor ID: 0x02c9 Device vendor part ID: 4119 Default device parameters will be used, which may result in lower performance. You can edit any of the files specified by the btl_openib_device_param_files MCA parameter to set values for your device. NOTE: You can turn off this warning by setting the MCA parameter btl_openib_warn_no_device_params_found to 0. --------------------------------------------------------------------------
These warnings are harmless and the application should continue to work.
- Applications that use newer versions of OpenMPI will not display this warning at runtime.
- Applications compiled with MVAPICH2 will continue to work on the AMD nodes, but they will not work on the Intel nodes (EDR IB network). If you use MVAPICH2 to compile your code and want to run your jobs on the Intel nodes that are part of the "batch" queue, please recompile them using an MVAPICH2 module with an -EDR suffix. The following MVAPICH2 modules are available for the EDR IB fabric:
MVAPICH2/2.3-GCC-5.4.0-2.26-EDR MVAPICH2/2.3-GCC-6.4.0-2.28-EDR MVAPICH2/2.3-iccifort-2013_sp1.0.080-EDR MVAPICH2/2.3-iccifort-2015.2.164-GCC-4.8.5-EDR MVAPICH2/2.3-iccifort-2018.1.163-GCC-6.4.0-2.28-EDR
Note that code compiled with an MVAPICH2 module that has an -EDR suffix will not work on the AMD nodes (QDR IB network). Please continue to use the MVAPICH2 modules that do not have an -EDR suffix if you want to run the jobs on the AMD nodes.