Rocky 9 Transition Guide: Difference between revisions

From Research Computing Center Wiki
Jump to navigation Jump to search
Line 6: Line 6:
* Update the cluster Linux operating system from Rocky 8 to Rocky 9.
* Update the cluster Linux operating system from Rocky 8 to Rocky 9.


* Migrate the home directories to a new storage device.
* Migrate user home directories (/home) to a new storage device.


* Update firmware of some computing and network hardware.
* Update firmware of some computing and network hardware.


==Why is a major Operating System (OS) update necessary?==
==Why is a major Operating System (OS) update necessary?==

Revision as of 13:01, 30 June 2025

Introduction

As part of our July 29-31,2025 maintenance window, the GACRC will be perform the following tasks:

  • Update the cluster Linux operating system from Rocky 8 to Rocky 9.
  • Migrate user home directories (/home) to a new storage device.
  • Update firmware of some computing and network hardware.

Why is a major Operating System (OS) update necessary?

  • Existing OS (Rocky 8) is End of Life - There are no more full version updates being released for the existing operating system and newer versions of some software applications are not supported by the current OS version.
  • Updated system libraries (glibc 2.34) - The upgrade to Rocky 9 includes a newer glibc version, which is required by many modern scientific and HPC applications. This ensures better compatibility and performance for current and future software.
  • Newer kernel (5.14) - the updated Linux kernel enable us to better support new CPUs, GPUs, and storage devices.
  • Security improvement - to retain compliance with current and future security requirements, we must keep using a supported version of the operating system.


What does this mean to you and your workflows?

Overview

  • While the storage device serving the home directories will be updated, all existing /home, /scratch, /work, and /project spaces will retain their existing data.
  • The compiler toolchains and many software packages will be updated to newer versions.
  • Because this is a major OS update, we need to recompile all the applications and ensure that they work with the new version of OS.
  • We will have as comprehensive a software suite available on the new OS as possible, but some less widely used applications and older version software will not be immediately available.
  • As software modules will be reinstalled and updated, all pending jobs will be canceled during the maintenance window, to prevent job failure due to changes in the module names post maintenance.

Storage

The home directories will be migrated to a new storage device. Prior to the maintenance, files from /home were copied to the new storage device and periodic synchronization processes are run. At the start of the maintenance window, while no users are logged in and no jobs are running, we will perform one last synchronization of the files in /home from the old storage device to the new one.

All existing /home, /scratch, /work, /project, and /db spaces will be available after the maintenance and they will retain their existing data.


Queueing System

The Slurm queueing system will be updated from version 23.02.4 to version 24.11.4. Most compute nodes available on the Rocky 8 system will continue to be available after the transition to Rocky 9, and the Slurm partitions will remain the same.

Software

Warning

Because this is a major change in the operating system, most user software built on Rocky 8 will not work and will need to be rebuilt. Even if the programs run without being rebuilt, the change in the underlying libraries may impact code execution and results. Therefore, users should test and verify that their codes are producing the expected results on the new operating system.

Compiler toolchains

The base compiler toolchains used to build software libraries and applications on the cluster will be updated, as newer versions are able to generate more optimized code for newer computer hardware and newer software versions.

Base compiler toolchains on Rocky 8 (the current Sapelo2):

  • GCCcore/11.2.0, GCC/11.2.0, gompi/2021b, foss/2021b
  • GCCcore/11.3.0, GCC/11.3.0, gompi/2022a, foss/2022a
  • CUDA versions 11.4, 11.7, 12.1
  • OpenMPI versions 4.1.1 and 4.1.4


Base compiler toolchains on Rocky 9:

  • GCCcore/12.3.0, GCC/12.3.0, gompi/2023a, foss/2023a
  • GCCcore/13.3.0, GCC/13.3.0, gompi/2024a, foss/2024a
  • CUDA versions 12.1, 12.4, 12.6, 12.8
  • OpenMPI versions 4.1.5 and 5.0.3


Centrally installed modules

Centrally installed software modules will continue to have the format Name/Version-Toolchain, but for most software packages the Version and Toolchain will be updated. Some module names have the format Name/Version-Toolchain-Versionsuffix with an optional Versionsuffix that might change or be dropped on the new system. There are also some modules whose names will remain the same on the Rocky 9 system. Some examples:

Software Module name on Rocky 8 Module name on Rocky 9 Changes
AlphaFold AlphaFold/2.3.1-foss-2022a-CUDA-11.7.0 AlphaFold/2.3.2-foss-2023a-CUDA-12.1.1 version, toolchain, CUDA version
BLAST+ BLAST+/2.13.0-gompi-2022a BLAST+/2.16.0-gompi-2024a version, toolchain
BRAKER BRAKER/3.0.8-foss-2022a BRAKER/3.0.8-foss-2022a toolchain
SAMtools SAMtools/1.18-GCC-12.3.0 SAMtools/1.18-GCC-12.3.0 no changes
SAMtools SAMtools/1.18-GCC-12.3.0 SAMtools/1.21-GCC-13.3.0 version, toolchain, though old version still available
SpaceRanger SpaceRanger/2.1.0-GCC-11.3.0 not available (yet)
STAR STAR/2.7.10b-GCC-11.3.0 STAR/2.7.11b-GCC-13.3.0 version, toolchain
UMI-tools UMI-tools/1.1.2-foss-2022a-Python-3.10.4 UMI-tools/1.1.4-foss-2023a version, toolchain, versionsuffix

A list of the modules already installed on the Rocky 9 system is available at Software installed on Rocky 9.

Conda environments

Some users have conda environments installed in their home directory or group shared directories. These environments should be reinstalled on the Rocky 9 system, using versions of Miniforge available there. Because of changes in Anaconda licenses, the updated cluster does not provide a central installation of Anaconda and Miniconda. We suggest that users use Miniforge to create conda environments.

Mamba and Micromamba are available on the the Rocky 9 system as well.

Documentation on how to install conda environments on the cluster is available at https://wiki.gacrc.uga.edu/wiki/Installing_Applications_on_Sapelo2

Python packages

Python libraries and virtual environments need to be reinstalled as well, using versions of Python available there.

R packages

We recommend that user reinstall any R packages that they have installed in their own directories, to make sure they are compatible with the new OS version and with the versions of R available there.

Singularity containers

Singularity containers that you used on Rocky 8 should continue to work on the Rocky 9 system. The containers installed centrally in /apps/singularity-images will be available after the maintenance.


Potential issues

Error connecting to Sapelo2

Because Sapelo2 was reinstalled, you might encounter a "host key" or "host id" error when you connect to Sapelo2 for the first time after the maintenance.


Connecting from MacOS or Linux

Users connecting from a MacOS or a Linux system might see an error like this:

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@       WARNING: POSSIBLE DNS SPOOFING DETECTED!          @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
The ECDSA host key for sapelo2 has changed,
and the key for the corresponding IP address 128.192.75.18
is unchanged. This could either mean that
DNS SPOOFING is happening or the IP address for the host
and its host key have changed at the same time.
Offending key for IP in /Users/jsmith/.ssh/known_hosts:76
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ECDSA key sent by the remote host is
SHA256:E1ovq19vLNYNF1eFiOQ91tc1EPtbHcMhML2I45UrJrE.
Please contact your system administrator.
Add correct host key in /Users/jsmith/.ssh/known_hosts to get rid of this message.
Offending ECDSA key in /Users/jsmith/.ssh/known_hosts:25
ECDSA host key for sapelo2 has changed and you have requested strict checking.
Host key verification failed.

To fix this problem, you will need to remove the keys belonging to the host, sapelo2.gacrc.uga.edu. This can be done by manually deleting all lines corresponding to the host, sapelo2.gacrc.uga.edu, in the ~/.ssh/known_hosts file, or by executing the command:

ssh-keygen -R sapelo2.gacrc.uga.edu

Once you have done this, you should be able to ssh into sapelo2.gacrc.uga.edu. You might still get a message like this:

[jsmith@laptop]$ ssh jsmith@sapelo2.gacrc.uga.edu
The authenticity of host 'sapelo2.gacrc.uga.edu' can't be established.
ECDSA key fingerprint is SHA256:ikdjggjeorjgnkresitnsgjsms
ECDSA key fingerprint is MD5:be:1xxxxxxxxxxxx
Are you sure you want to continue connecting (yes/no)? 

You can type yes and your connection should work.


Connecting from Windows

When connecting from Windows for the first time after the maintenance, users might encounter an error like POTENTIAL SECURITY BREACH or HOST IDENTIFICATION HAS CHANGED. Users can click Yes to continue the connection and have a new host key saved on their local machines.

Modules in your .bashrc no longer work or give errors on login

If you have edited your .bashrc file to include commands to load modules automatically when you login, you may find that some Rocky 8 modules will not be found or may not work on Rocky 9. You will need to edit your .bashrc and comment out or remove any such lines. You can also replace the module load commands in your .bashrc file with new module names. If you can no longer log in because of something in your .bashrc, contact us and we can rename your .bashrc and copy in a default version for you.

If you’d like to start from scratch, a default .bashrc contains the following:

# .bashrc

# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi

# User specific aliases and functions below

Job gets module not found errors, the same script used to work on Sapelo2

Many software modules have been updated with a new version and/or a new toolchain version. The modules your jobs loaded on the Rocky 8 system might not be available on Rocky 9. Please check the name of the modules on the updated cluster. You can search for a module using the ml spider NAME command, where NAME needs to be replaced by the software package name that you are searching for. You can also see a list of all installed software with the command ml avail.

Job gets command not found errors, but module load command included in job submission script

If your are attempting to load a module that was available on Rocky 8, but no longer available on Rocky 9, the module will not be loaded, and the commands provided by that module will not be available for the job. Please check the correct name of the modules on the Rocky 9 system. If the software is not available on the updated cluster, please feel free to submit a software installation request ticket and we will try to get it installed for you.

Python scripts not working anymore

Please note that the updated Sapelo2 does not have /usr/bin/python or /usr/bin/python2 installed. The OS comes with a default /usr/bin/python3 (v. 3.9.21). Scripts that have the first line:

#!/usr/bin/python

or

#!/usr/bin/python2

will not work on Sapelo2 (with the Rocky 9 OS). We recommend that you change this line to

#!/usr/bin/env python

and load one of the Python modules before running the script. The following command on Sapelo2 will show all the Python modules installed centrally:

ml spider Python