Changes implemented during November 16-22, 2018 maintenance

From Research Computing Center Wiki
Jump to navigation Jump to search

Sapelo decommissioned

The original Sapelo cluster (sapelo1.gacrc.uga.edu) is now fully decommissioned and all nodes that were originally part of Sapelo were migrated into Sapelo2.

New hardware added into Sapelo2

In addition to the nodes that were originally part of Sapelo, we have added the following new hardware into Sapelo2:

  • 44 compute nodes with Intel Xeon Skylake processors (32 cores and 187GB of RAM per node)
  • 4 compute nodes with Intel Xeon Skylake processors (32 cores and 187GB of RAM) and 1 NVIDIA P100 GPU card per node
  • DDN SFA14KX Lustre appliance (1.26PB), serving /scratch and /work areas
  • An EDR Infiniband network that provides internodal communication among a subset of compute nodes and between these nodes and the storage devices serving the users home directories and the scratch and work directories. The subset of nodes that are connected to the EDR IB fabric are the new Intel Skylake nodes and and Intel Broadwell nodes that were already on Sapelo2.

Changes in the storage systems on Sapelo2

/lustre1 decommissioned

The Seagate (Xyratex) ClusterStor1500 Lustre appliance that served /lustre1 has been decommissioned. The new scratch file system is called /scratch and resides in the new DDN SFA14KX Lustre appliance. All files that were in /lustre1 at the start of the maintenance window (5PM Nov. 16th, 2018) were copied into the new /scratch file system.

Jobs that have already been submitted from the /lustre1 directory (e.g. jobs that were pending during the maintenance) will not be affected by the decommissioning of /lustre1, as a link is provided that points /lustre1 to /scratch.

Any new jobs using the scratch file system should be accessed from /scratch and not from /lustre1. The /lustre1 link will be removed after the jobs that were submitted before the maintenance work complete.

Please do not submit any new jobs from /lustre1

If your job submission scripts or other programs have the /lustre1 path hardcoded in them, please change these files to replace /lustre1 by /scratch.

Some applications experienced issues on Sapelo2 when using /lustre1, because /lustre1 did not have the file locking feature enabled on Sapelo2. These issues should not occur on Sapelo2 when using /scratch, as /scratch does have the file locking feature enabled.

/scratch file system and file retention policy

Coming soon: "30-day purge" policy for files in /scratch

/work file system

Each group now has a /work directory (e.g. for abclab group, this directory is called /work/abclab) with a per group quota of 500GB and 100,000 files that can be used to store input files that are often needed for repeated jobs, possibly by multiple users within a group, such as reference data and model data.

https://wiki.gacrc.uga.edu/wiki/Disk_Storage