Difference between revisions of "Rocky 8 Transition Guide"

From Research Computing Center Wiki
Jump to navigation Jump to search
m (Cosmetic changes (commas, etc.))
m (Proposed "why rocky 8".)
Line 7: Line 7:
  
 
* Existing RHEL-7-based OS is End of Life - There are no more full version updates being released for the existing operating system and newer versions of some software applications are not supported by the current OS version.
 
* Existing RHEL-7-based OS is End of Life - There are no more full version updates being released for the existing operating system and newer versions of some software applications are not supported by the current OS version.
* Bringing New Nodes Online - As development within the existing OS has stopped, some of the latest generation of compute node hardware cannot use it, needing driver types newer than what this OS has. New hardware and architecture that we will be bringing online soon requires this OS update.
+
* Hardware Support for new nodes and processors - As development within the existing OS has stopped, some of the latest generation of compute node hardware cannot use it, needing driver types newer than what this OS has. New hardware and architecture that we will be bringing online soon requires this OS update.
* Security Improvements - In order to keep our cluster as up to date as possible, these kinds of big OS updates need to happen.
+
* Security - to retain compliance with current and future security requirements, we must keep using a supported version of the operating system.
* Why Rocky 8? - A good portion of the HPC centers is adopting it, which means there is a good amount of community support.
+
* Why Rocky 8? - A good portion of the HPC centers is adopting it, which means there is a good amount of community support. PB proposes: "The community around this RHEL-based distribution (its development and support) is primarily HPC-oriented, making it a good fit for HPC centers."
  
 
==What does this mean to you and your workflows?==
 
==What does this mean to you and your workflows?==

Revision as of 12:57, 28 July 2023

Introduction

As part of our August 29-31,2023 maintenance window, the GACRC will be upgrading the Sapelo2 cluster operating system from CentOS 7 to Rocky 8.

Why is a major Operating System (OS) update necessary?

  • Existing RHEL-7-based OS is End of Life - There are no more full version updates being released for the existing operating system and newer versions of some software applications are not supported by the current OS version.
  • Hardware Support for new nodes and processors - As development within the existing OS has stopped, some of the latest generation of compute node hardware cannot use it, needing driver types newer than what this OS has. New hardware and architecture that we will be bringing online soon requires this OS update.
  • Security - to retain compliance with current and future security requirements, we must keep using a supported version of the operating system.
  • Why Rocky 8? - A good portion of the HPC centers is adopting it, which means there is a good amount of community support. PB proposes: "The community around this RHEL-based distribution (its development and support) is primarily HPC-oriented, making it a good fit for HPC centers."

What does this mean to you and your workflows?

Overview

  • We are not changing anything from the data storage standpoint. All existing /home, /scratch, /work, and /project spaces will retain their existing data.
  • The compiler toolchains and many software packages will be updated to newer versions.
  • Because this is a major OS update, we need to recompile all the applications and ensure that they work with the new version of OS.
  • We will have as comprehensive a software suite available on the new OS as possible, but some less widely used applications and older version software will not be immediately available.
  • As software modules will be reinstalled and updated, all pending jobs will be canceled during the maintenance window, to prevent job failure due to changes in the module names post maintenance.

Storage

There will be no changes to the storage system at this maintenance window. All existing /home, /scratch, /work, /project, and /db spaces will be available after the maintenance and they will retain their existing data.


Queueing System

The Slurm queueing system will be updated from version 21.08.8 to version 23.02.2. Most compute nodes available on the CentOS 7 system will continue to be available after the transition to Rocky 8, and the Slurm partitions will remain the same.


Software

Warning

Because this is a major change in the operating system, most user software built on CentOS 7 will not work and will need to be rebuilt. Even if the programs run without being rebuilt, the change in the underlying libraries may impact code execution and results. Therefore, users should test and verify that their codes are producing the expected results on the new operating system.

Compiler toolchains

The base compiler toolchains used to build software libraries and applications on the cluster will be updated, as newer versions are able to generate more optimized code for newer computer hardware and newer software versions.

Base compiler toolchains on CentOS 7 (the current Sapelo2):

  • GCCcore/8.3.0, GCC/8.3.0, gompi/2019b, foss/2019b
  • GCCcore/10.2.0, GCC/10.2.0, gompi/2020b, foss/2020b
  • CUDA versions 10.2 and 11.1
  • OpenMPI versions 3.1.4 and 4.0.5

Base compiler toolchains on Rocky 8:

  • GCCcore/11.2.0, GCC/11.2.0, gompi/2021b, foss/2021b
  • GCCcore/11.3.0, GCC/11.3.0, gompi/2022a, foss/2022a
  • CUDA versions 11.4, 11.7, and 12.0
  • OpenMPI versions 4.1.2 and 4.1.4

Centrally installed modules

Centrally installed software modules will continue to have the format Name/Version-Toolchain, but for most software packages the Version and Toolchain will updated. Some module names have an optional Versionsuffix and it might change or be dropped on the new system. There are modules whose names will remain the same on the Rocky 8 system. Some examples:

Software Module name on CentOS 7 Module name on Rocky 8 Changes
ABySS ABySS/2.3.1-foss-2019b ABySS/2.3.5-foss-2021b version, toolchain
BLAST+ BLAST+/2.12.0-gompi-2020b BLAST+/2.13.0-gompi-2022a version, toolchain
BWA BWA/0.7.17-GCC-10.3.0 BWA/0.7.17-GCCcore-11.2.0 toolchain
DeepAffinity DeepAffinity/0.1 not available (yet)
SAMtools SAMtools/1.16.1-GCC-11.3.0 SAMtools/1.16.1-GCC-11.3.0 no changes
STAR STAR/2.7.10a-GCC-8.3.0 STAR/2.7.10b-GCC-11.3.0 version, toolchain
Trinity Trinity/2.10.0-foss-2019b-Python-3.7.4 Trinity/2.15.1-foss-2022a version, toolchain, versionsuffix

Conda environments

Some users have conda environments installed in their home directory or group shared directories. These environments should be reinstalled on the Rocky 8 system, using versions of Miniconda or Anaconda available there. Documentation on how to install conda environments on the cluster is available at https://wiki.gacrc.uga.edu/wiki/Installing_Applications_on_Sapelo2

Python packages

Python libraries and virtual environments need to be reinstalled as well, using versions of Python, Miniconda, or Anaconda available there.

R packages

We recommend that user reinstall any R packages that they have installed in their own directories, to make sure they are compatible with the new OS version and with the versions of R available there.

Singularity containers

Singularity containers that you used on CentOS 7 should continue to work on the Rocky 8 system. The containers installed centrally in /apps/singularity-images will be available after the maintenance.


Potential issues

Error connecting to Sapelo2

Because Sapelo2 was reinstalled, you might encounter a "host key" or "host id" error when you connect to Sapelo2 for the first time after the maintenance.


Connecting from MacOS or Linux

Users connecting from a MacOS or a Linux system might see an error like this:

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@       WARNING: POSSIBLE DNS SPOOFING DETECTED!          @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
The ECDSA host key for sapelo2 has changed,
and the key for the corresponding IP address 128.192.75.18
is unchanged. This could either mean that
DNS SPOOFING is happening or the IP address for the host
and its host key have changed at the same time.
Offending key for IP in /Users/jsmith/.ssh/known_hosts:76
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ECDSA key sent by the remote host is
SHA256:E1ovq19vLNYNF1eFiOQ91tc1EPtbHcMhML2I45UrJrE.
Please contact your system administrator.
Add correct host key in /Users/jsmith/.ssh/known_hosts to get rid of this message.
Offending ECDSA key in /Users/jsmith/.ssh/known_hosts:25
ECDSA host key for sapelo2 has changed and you have requested strict checking.
Host key verification failed.

To fix this problem, open the known_hosts file on your local machine (in the example above the full path to this file is /Users/jsmith/.ssh/known_hosts, as shown in the error message above). Then delete the line that has sapelo2.gacrc.uga.edu and save the file.

Once you have done this, you should be able to ssh into sapelo2.gacrc.uga.edu. You might still get a message like this:

[jsmith@laptop]$ ssh jsmith@sapelo2.gacrc.uga.edu
The authenticity of host 'sapelo2.gacrc.uga.edu' can't be established.
ECDSA key fingerprint is SHA256:ikdjggjeorjgnkresitnsgjsms
ECDSA key fingerprint is MD5:be:1xxxxxxxxxxxx
Are you sure you want to continue connecting (yes/no)? 

You can type yes and your connection should work.


Connecting from Windows

When connecting from Windows for the first time after the maintenance, users might encounter an error like POTENTIAL SECURITY BREACH or HOST IDENTIFICATION HAS CHANGED. Users can click Yes to continue the connection and have a new host key saved on their local machines.

Modules in your .bashrc no longer work or give errors on login

If you have edited your .bashrc file to include commands to load modules automatically when you login, you may find that some CentOS 7 modules will not be found or may not work on Rocky 8. You will need to edit your .bashrc and comment out or remove any such lines. You can also replace the module load commands in your .bashrc file with new module names. If you can no longer log in because of something in your .bashrc, contact us and we can rename your .bashrc and copy in a default version for you.

If you’d like to start from scratch, a default .bashrc contains the following:

# .bashrc

# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi

# User specific aliases and functions below

My job gets module not found errors, the same script used to work on Sapelo2

Many software modules have been updated with a new version and/or a new toolchain version. The modules your jobs loaded on the CentOS 7 system might not be available on Rocky 8. Please check the name of the modules on the updated cluster. You can search for a module using the ml spider NAME command, where NAME needs to be replaced by the software package name that you are searching for. You can also see a list of all installed software with the command ml avail.

My job gets command not found errors, but I did load the module

If your are attempting to load a module that was available on CentOS 7, but no longer available on Rocky 8, the module will not be loaded, and the commands provided by that module will not be available for the job. Please check the correct name of the modules on the Rocky 8 system. If the software is not available on the updated cluster, please feel free to submit a software installation request ticket and we will try to get it installed for you.