Rocky 8 Transition Guide: Difference between revisions
m (Proposed "why rocky 8".) |
m (Updated description of how to remove old hosts in ~/.ssh/known_hosts file) |
||
(11 intermediate revisions by 2 users not shown) | |||
Line 2: | Line 2: | ||
==Introduction== | ==Introduction== | ||
As part of our | As part of our September 5-7,2023 maintenance window, the GACRC will be upgrading the Sapelo2 cluster operating system from CentOS 7 to Rocky 8. | ||
==Why is a major Operating System (OS) update necessary?== | ==Why is a major Operating System (OS) update necessary?== | ||
Line 9: | Line 9: | ||
* Hardware Support for new nodes and processors - As development within the existing OS has stopped, some of the latest generation of compute node hardware cannot use it, needing driver types newer than what this OS has. New hardware and architecture that we will be bringing online soon requires this OS update. | * Hardware Support for new nodes and processors - As development within the existing OS has stopped, some of the latest generation of compute node hardware cannot use it, needing driver types newer than what this OS has. New hardware and architecture that we will be bringing online soon requires this OS update. | ||
* Security - to retain compliance with current and future security requirements, we must keep using a supported version of the operating system. | * Security - to retain compliance with current and future security requirements, we must keep using a supported version of the operating system. | ||
* Why Rocky 8? - | * Why Rocky 8? - The community around the development and support of this RHEL-based distribution is primarily HPC-oriented, making it a good fit for HPC centers. | ||
==What does this mean to you and your workflows?== | ==What does this mean to you and your workflows?== | ||
Line 33: | Line 33: | ||
===Queueing System=== | ===Queueing System=== | ||
The Slurm queueing system will be updated from version 21.08.8 to version 23.02. | The Slurm queueing system will be updated from version 21.08.8 to version 23.02.4. Most compute nodes available on the CentOS 7 system will continue to be available after the transition to Rocky 8, and the Slurm partitions will remain the same. | ||
===Software=== | ===Software=== | ||
Line 55: | Line 54: | ||
*GCCcore/11.3.0, GCC/11.3.0, gompi/2022a, foss/2022a | *GCCcore/11.3.0, GCC/11.3.0, gompi/2022a, foss/2022a | ||
*CUDA versions 11.4, 11.7, and 12.0 | *CUDA versions 11.4, 11.7, and 12.0 | ||
*OpenMPI versions 4.1. | *OpenMPI versions 4.1.1 and 4.1.4 | ||
====Centrally installed modules==== | ====Centrally installed modules==== | ||
Centrally installed software modules will continue to have the format <b>Name/Version-Toolchain</b>, but for most software packages the <b>Version</b> and <b>Toolchain</b> will updated. Some module names have an optional <b>Versionsuffix</b> | Centrally installed software modules will continue to have the format <b>Name/Version-Toolchain</b>, but for most software packages the <b>Version</b> and <b>Toolchain</b> will be updated. Some module names have the format <b>Name/Version-Toolchain-Versionsuffix</b> with an optional <b>Versionsuffix</b> that might change or be dropped on the new system. There are also some modules whose names will remain the same on the Rocky 8 system. Some examples: | ||
{| class="wikitable" | {| class="wikitable" | ||
Line 78: | Line 77: | ||
| Trinity || Trinity/2.10.0-foss-2019b-Python-3.7.4 || Trinity/2.15.1-foss-2022a || version, toolchain, versionsuffix | | Trinity || Trinity/2.10.0-foss-2019b-Python-3.7.4 || Trinity/2.15.1-foss-2022a || version, toolchain, versionsuffix | ||
|} | |} | ||
A list of the modules already installed on the Rocky 8 system is available at [[Software installed on Rocky 8]]. | |||
====Conda environments==== | ====Conda environments==== | ||
Line 132: | Line 133: | ||
</pre> | </pre> | ||
To fix this problem, | To fix this problem, you will need to remove the keys belonging to the host, <code>sapelo2.gacrc.uga.edu</code>. This can be done by manually deleting all lines corresponding to the host, <code>sapelo2.gacrc.uga.edu</code>, in the <code>~/.ssh/known_hosts</code> file, or by executing the command:<syntaxhighlight lang="bash"> | ||
ssh-keygen -R sapelo2.gacrc.uga.edu | |||
Once you have done this, you should be able to ssh into sapelo2.gacrc.uga.edu. You might still get a message like this: | </syntaxhighlight>Once you have done this, you should be able to ssh into sapelo2.gacrc.uga.edu. You might still get a message like this: | ||
<pre class="gcomment"> | <pre class="gcomment"> | ||
Line 168: | Line 169: | ||
</pre> | </pre> | ||
=== | ===Job gets module not found errors, the same script used to work on Sapelo2=== | ||
Many software modules have been updated with a new version and/or a new toolchain version. The modules your jobs loaded on the CentOS 7 system might not be available on Rocky 8. Please check the name of the modules on the updated cluster. You can search for a module using the <code>ml spider NAME</code> command, where NAME needs to be replaced by the software package name that you are searching for. You can also see a list of all installed software with the command <code>ml avail</code>. | Many software modules have been updated with a new version and/or a new toolchain version. The modules your jobs loaded on the CentOS 7 system might not be available on Rocky 8. Please check the name of the modules on the updated cluster. You can search for a module using the <code>ml spider NAME</code> command, where NAME needs to be replaced by the software package name that you are searching for. You can also see a list of all installed software with the command <code>ml avail</code>. | ||
=== | ===Job gets command not found errors, but module load command included in job submission script=== | ||
If your are attempting to load a module that was available on CentOS 7, but no longer available on Rocky 8, the module will not be loaded, and the commands provided by that module will not be available for the job. Please check the correct name of the modules on the Rocky 8 system. If the software is not available on the updated cluster, please feel free to [https://uga.teamdynamix.com/TDClient/2060/Portal/Requests/ServiceDet?ID=25850 submit a software installation request ticket] and we will try to get it installed for you. | If your are attempting to load a module that was available on CentOS 7, but no longer available on Rocky 8, the module will not be loaded, and the commands provided by that module will not be available for the job. Please check the correct name of the modules on the Rocky 8 system. If the software is not available on the updated cluster, please feel free to [https://uga.teamdynamix.com/TDClient/2060/Portal/Requests/ServiceDet?ID=25850 submit a software installation request ticket] and we will try to get it installed for you. | ||
===Python scripts not working anymore=== | |||
Please note that the updated Sapelo2 does not have /usr/bin/python installed. The OS comes with a default /usr/bin/python2 (v. 2.7.18) and a default /usr/bin/python3 (v. 3.6.8). Scripts that have the first line: | |||
<pre class="gscript"> | |||
#!/usr/bin/python | |||
</pre> | |||
will not work on Sapelo2 (with the Rocky 8 OS). We recommend that you change this line to | |||
<pre class="gscript"> | |||
#!/usr/bin/env python | |||
</pre> | |||
and load one of the Python modules before running the script. The following command on Sapelo2 will show all the Python modules installed centrally: | |||
<pre class="gscript"> | |||
ml spider Python | |||
</pre> |
Latest revision as of 14:44, 4 October 2023
Introduction
As part of our September 5-7,2023 maintenance window, the GACRC will be upgrading the Sapelo2 cluster operating system from CentOS 7 to Rocky 8.
Why is a major Operating System (OS) update necessary?
- Existing RHEL-7-based OS is End of Life - There are no more full version updates being released for the existing operating system and newer versions of some software applications are not supported by the current OS version.
- Hardware Support for new nodes and processors - As development within the existing OS has stopped, some of the latest generation of compute node hardware cannot use it, needing driver types newer than what this OS has. New hardware and architecture that we will be bringing online soon requires this OS update.
- Security - to retain compliance with current and future security requirements, we must keep using a supported version of the operating system.
- Why Rocky 8? - The community around the development and support of this RHEL-based distribution is primarily HPC-oriented, making it a good fit for HPC centers.
What does this mean to you and your workflows?
Overview
- We are not changing anything from the data storage standpoint. All existing /home, /scratch, /work, and /project spaces will retain their existing data.
- The compiler toolchains and many software packages will be updated to newer versions.
- Because this is a major OS update, we need to recompile all the applications and ensure that they work with the new version of OS.
- We will have as comprehensive a software suite available on the new OS as possible, but some less widely used applications and older version software will not be immediately available.
- As software modules will be reinstalled and updated, all pending jobs will be canceled during the maintenance window, to prevent job failure due to changes in the module names post maintenance.
Storage
There will be no changes to the storage system at this maintenance window. All existing /home, /scratch, /work, /project, and /db spaces will be available after the maintenance and they will retain their existing data.
Queueing System
The Slurm queueing system will be updated from version 21.08.8 to version 23.02.4. Most compute nodes available on the CentOS 7 system will continue to be available after the transition to Rocky 8, and the Slurm partitions will remain the same.
Software
Warning
Because this is a major change in the operating system, most user software built on CentOS 7 will not work and will need to be rebuilt. Even if the programs run without being rebuilt, the change in the underlying libraries may impact code execution and results. Therefore, users should test and verify that their codes are producing the expected results on the new operating system.
Compiler toolchains
The base compiler toolchains used to build software libraries and applications on the cluster will be updated, as newer versions are able to generate more optimized code for newer computer hardware and newer software versions.
Base compiler toolchains on CentOS 7 (the current Sapelo2):
- GCCcore/8.3.0, GCC/8.3.0, gompi/2019b, foss/2019b
- GCCcore/10.2.0, GCC/10.2.0, gompi/2020b, foss/2020b
- CUDA versions 10.2 and 11.1
- OpenMPI versions 3.1.4 and 4.0.5
Base compiler toolchains on Rocky 8:
- GCCcore/11.2.0, GCC/11.2.0, gompi/2021b, foss/2021b
- GCCcore/11.3.0, GCC/11.3.0, gompi/2022a, foss/2022a
- CUDA versions 11.4, 11.7, and 12.0
- OpenMPI versions 4.1.1 and 4.1.4
Centrally installed modules
Centrally installed software modules will continue to have the format Name/Version-Toolchain, but for most software packages the Version and Toolchain will be updated. Some module names have the format Name/Version-Toolchain-Versionsuffix with an optional Versionsuffix that might change or be dropped on the new system. There are also some modules whose names will remain the same on the Rocky 8 system. Some examples:
Software | Module name on CentOS 7 | Module name on Rocky 8 | Changes |
---|---|---|---|
ABySS | ABySS/2.3.1-foss-2019b | ABySS/2.3.5-foss-2021b | version, toolchain |
BLAST+ | BLAST+/2.12.0-gompi-2020b | BLAST+/2.13.0-gompi-2022a | version, toolchain |
BWA | BWA/0.7.17-GCC-10.3.0 | BWA/0.7.17-GCCcore-11.2.0 | toolchain |
DeepAffinity | DeepAffinity/0.1 | not available (yet) | |
SAMtools | SAMtools/1.16.1-GCC-11.3.0 | SAMtools/1.16.1-GCC-11.3.0 | no changes |
STAR | STAR/2.7.10a-GCC-8.3.0 | STAR/2.7.10b-GCC-11.3.0 | version, toolchain |
Trinity | Trinity/2.10.0-foss-2019b-Python-3.7.4 | Trinity/2.15.1-foss-2022a | version, toolchain, versionsuffix |
A list of the modules already installed on the Rocky 8 system is available at Software installed on Rocky 8.
Conda environments
Some users have conda environments installed in their home directory or group shared directories. These environments should be reinstalled on the Rocky 8 system, using versions of Miniconda or Anaconda available there. Documentation on how to install conda environments on the cluster is available at https://wiki.gacrc.uga.edu/wiki/Installing_Applications_on_Sapelo2
Python packages
Python libraries and virtual environments need to be reinstalled as well, using versions of Python, Miniconda, or Anaconda available there.
R packages
We recommend that user reinstall any R packages that they have installed in their own directories, to make sure they are compatible with the new OS version and with the versions of R available there.
Singularity containers
Singularity containers that you used on CentOS 7 should continue to work on the Rocky 8 system. The containers installed centrally in /apps/singularity-images will be available after the maintenance.
Potential issues
Error connecting to Sapelo2
Because Sapelo2 was reinstalled, you might encounter a "host key" or "host id" error when you connect to Sapelo2 for the first time after the maintenance.
Connecting from MacOS or Linux
Users connecting from a MacOS or a Linux system might see an error like this:
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: POSSIBLE DNS SPOOFING DETECTED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ The ECDSA host key for sapelo2 has changed, and the key for the corresponding IP address 128.192.75.18 is unchanged. This could either mean that DNS SPOOFING is happening or the IP address for the host and its host key have changed at the same time. Offending key for IP in /Users/jsmith/.ssh/known_hosts:76 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! Someone could be eavesdropping on you right now (man-in-the-middle attack)! It is also possible that a host key has just been changed. The fingerprint for the ECDSA key sent by the remote host is SHA256:E1ovq19vLNYNF1eFiOQ91tc1EPtbHcMhML2I45UrJrE. Please contact your system administrator. Add correct host key in /Users/jsmith/.ssh/known_hosts to get rid of this message. Offending ECDSA key in /Users/jsmith/.ssh/known_hosts:25 ECDSA host key for sapelo2 has changed and you have requested strict checking. Host key verification failed.
To fix this problem, you will need to remove the keys belonging to the host, sapelo2.gacrc.uga.edu
. This can be done by manually deleting all lines corresponding to the host, sapelo2.gacrc.uga.edu
, in the ~/.ssh/known_hosts
file, or by executing the command:
ssh-keygen -R sapelo2.gacrc.uga.edu
Once you have done this, you should be able to ssh into sapelo2.gacrc.uga.edu. You might still get a message like this:
[jsmith@laptop]$ ssh jsmith@sapelo2.gacrc.uga.edu The authenticity of host 'sapelo2.gacrc.uga.edu' can't be established. ECDSA key fingerprint is SHA256:ikdjggjeorjgnkresitnsgjsms ECDSA key fingerprint is MD5:be:1xxxxxxxxxxxx Are you sure you want to continue connecting (yes/no)?
You can type yes and your connection should work.
Connecting from Windows
When connecting from Windows for the first time after the maintenance, users might encounter an error like POTENTIAL SECURITY BREACH or HOST IDENTIFICATION HAS CHANGED. Users can click Yes to continue the connection and have a new host key saved on their local machines.
Modules in your .bashrc no longer work or give errors on login
If you have edited your .bashrc file to include commands to load modules automatically when you login, you may find that some CentOS 7 modules will not be found or may not work on Rocky 8. You will need to edit your .bashrc and comment out or remove any such lines. You can also replace the module load commands in your .bashrc file with new module names. If you can no longer log in because of something in your .bashrc, contact us and we can rename your .bashrc and copy in a default version for you.
If you’d like to start from scratch, a default .bashrc contains the following:
# .bashrc # Source global definitions if [ -f /etc/bashrc ]; then . /etc/bashrc fi # User specific aliases and functions below
Job gets module not found errors, the same script used to work on Sapelo2
Many software modules have been updated with a new version and/or a new toolchain version. The modules your jobs loaded on the CentOS 7 system might not be available on Rocky 8. Please check the name of the modules on the updated cluster. You can search for a module using the ml spider NAME
command, where NAME needs to be replaced by the software package name that you are searching for. You can also see a list of all installed software with the command ml avail
.
Job gets command not found errors, but module load command included in job submission script
If your are attempting to load a module that was available on CentOS 7, but no longer available on Rocky 8, the module will not be loaded, and the commands provided by that module will not be available for the job. Please check the correct name of the modules on the Rocky 8 system. If the software is not available on the updated cluster, please feel free to submit a software installation request ticket and we will try to get it installed for you.
Python scripts not working anymore
Please note that the updated Sapelo2 does not have /usr/bin/python installed. The OS comes with a default /usr/bin/python2 (v. 2.7.18) and a default /usr/bin/python3 (v. 3.6.8). Scripts that have the first line:
#!/usr/bin/python
will not work on Sapelo2 (with the Rocky 8 OS). We recommend that you change this line to
#!/usr/bin/env python
and load one of the Python modules before running the script. The following command on Sapelo2 will show all the Python modules installed centrally:
ml spider Python