The Sapelo2 Project

From Research Computing Center Wiki
Jump to navigation Jump to search


Overview – “Sapelo” to “Sapelo2” Migration Project

Rationale

With the rapid rise of the number of users on the Sapelo cluster, we have noted the concurrent rise in impactful issues which seem inherent to the design of the cluster management stack that currently runs Sapelo. We propose replacing the legacy deployment with a more flexible, robust set of components.

Goals

- Remove problems experienced with Sapelo’s cluster management stack. While the vendor has addressed some issues, other issues are a result of design choices and they are unable to address them.

- Replicate a user environment that will not be different from the current user experience available on Sapelo. This will enable a transparent migration from one platform to another.

- Considerably increase resiliency to faults.


Proposed Approach

Sapelo2 is essentially a transformation of Sapelo. The compute layer, storage devices and network fabrics (InfiniBand and Ethernet) will remain physically in place. A new management layer will be built on hardware currently installed in the GACRC’s development environment. Login and queueing/scheduling services will now be provided through physical servers and not lightweight virtual servers.

Once all management components are deployed, tested and integrated, then we will proceed to create a first version of the new cluster through the integration of a set of initial compute nodes.

All relevant filesystems will be mounted per Sapelo standards. As a parallel effort by the GACRC support team, an initial set of scientific software, compilers, libraries will be installed on a separate robust filesystem that will be accessible to all compute nodes of Sapelo2.

These efforts would result in a cluster ready for pre-production testing and acceptance. Acceptance will be performed both by the GACRC support team as well as by selected users with a wide variety of requirements and experience.

Acceptance tests will cover the usual performance and stress suites, but will also examine issues of usability/functionality, accuracy of monitoring mechanisms, network connectivity and performance across the Ethernet and InfiniBand fabrics, appropriate access to filesystems, queueing system, system management software, program development environment and various O/S functionality.

Bringing Sapelo2 into full production would consist of draining and logically migrating the current set of Sapelo compute nodes onto the Sapelo2 environment, till Sapelo does not exist as a cluster and Sapelo2 is fully populated.

The expected end-result is that one day users will be asked to not login onto sapelo1.gacrc.uga.edu, but rather onto another location, and their whole user environment will be there, with a significantly enhanced cluster to work with.


A few technical issues experienced on Sapelo, tackled with Sapelo2

• Over quite a long period of time, ssh sessions to the Sapelo login node would freeze when idle for around 30 minutes. This issue has been resolved, but it contributed to many user's experience on Sapelo.

• Users cannot access the scratch area (/lustre1) from the login node. They need to use qlogin to get to an interactive node to submit jobs from /lustre1. This might give them a sense of inconvenience.

• Interactive nodes are slow and/or freeze from time to time.

• Until recently, the Sapelo login node only had 1 core and 500MB of RAM. So even a simple text editing process on the login node would sometimes get killed due to lack of RAM. Commands like showjobs sometimes did not work on the login node, also due to lack of memory.

• Text editors such as vi and nano don't work well on the Sapelo interactive nodes. Editing files on the login node was problematic due to its low RAM. So users sometimes ended up editing their files on the xfer nodes, but had to go back to submit jobs from the interactive node (i.e. from a qlogin session).

• Code compilation on the compute nodes, including qlogin sessions, don't work using higher versions of gcc (e.g. gcc 4.7.4 and 5.3.0). These compilations just hang on compute nodes.

• Because Sapelo only has one batch queue, users are required to include many PBS header lines in the job submission scripts. This might be seen as an inconvenience. Multiple queues and advanced scheduling mechanisms will be implemented on Sapelo2.

• Users cannot ssh into compute nodes to check on their jobs on Sapelo or to copy and delete files from /lscratch on the compute nodes when a job crashes.

• Some users were affected by the issue whereby home dirs did not get mounted on the login or xfer nodes when they logged into these systems. This issue was common over a period of some 9 months.

• When a user accidentally fills up his home dir, he cannot delete any files.

• Users cannot use ‘control-z’ to stop a process on an interactive node. If a control-z is used, the interactive session starts giving "bash fork" errors.

• Several users experienced issues with small MPI jobs crashing due to the lack of PSM contexts. There has also been confusion on how to request an appropriate number of nodes/cores for hybrid MPI/threaded jobs and how it relates to PSM contexts. For small MPI jobs users also are asked to request multiple of 3 cores per job.

• Issues with nodes crashing randomly, or processes hanging on the compute nodes.

• Some applications are extremely complex to install due to dependencies that are not properly addressed by the version of the O/S that we currently run on Sapelo. Sapelo2 will be running an updated O/S version that will address many of these issues.