Troubleshooting on Sapelo2

From Research Computing Center Wiki
Jump to navigation Jump to search

Introduction

This page provides an overview of common errors and their solutions that may come up when submitting jobs to an HPC cluster. Very often you will be able to quickly resolve errors that you may encounter by applying the solutions and guidelines outlined here.

General Troubleshooting Guidelines

We aim to provide a comprehensive resource for various errors that may come up when working on an HPC cluster, but occasionally you may encounter errors that are not specifically outlined here. These are just a few general troubleshooting guidelines that will help you resolve issues more quickly.

  • Check job output files. By default, when you submit a job to the cluster, it will produce an output file with the naming format of slurm-JobID.out in the directory from which you submitted your job. It is important to check this file in order to diagnose any error that may come up.
  • Read job output very carefully. Whether it is the default Slurm job output file or an output file produced by the software being used, it is worth taking the time to carefully read any output. Often these files may have a lot of text to parse through, but by taking the time to read them carefully, you may just find that one line that gives you the information you need to understand why an error is occurring.
  • At least look at the end of a stack trace. A stack trace is an often lengthy list of commands that happened when some exception occurred. It can be intimidating and confusing to look at, but very often the root cause of the problem will be at the bottom of the stack trace, often revealing the problem is something simple that you have the ability to fix.
  • Use Google. It is cliché advice, but too important not to mention. Perhaps you may find the relevant error in an output file, but it makes no sense to you. One of the first things to do in this situation is to copy the error and paste it into Google. More often than not, you will find someone who encountered the same or a similar error and a solution to it.
  • Be conscientious of the case of the text you type. One of the main things that takes time to adjust to if you're new to using Linux is getting used to being very particular about using upper or lowercase text. The case of text matters when referencing files or directories in Linux. For example, File1.txt and file1.txt are two different file names. There are many errors that can occur by using the incorrect case of text in Linux.
  • Be conscientious of your current directory and relative file paths. Sometimes it may seem that a file or directory is missing when in fact an incorrect path was referenced or care was not given to where things are located. For example, it is easy to make the mistake of trying to access your scratch directory, but forgetting to type the root directory, in which the scratch directory is located (/scratch/MyID vs. scratch/MyID). When submitting a job to Slurm, the job's execution directory will default to the directory from which you submitted the job. Therefore, make sure your input files are in that directory or have their location correctly specified.
  • Try troubleshooting job steps in an interactive session. It can be helpful when encountering an error with a job on an HPC cluster to start an interactive session, load necessary modules, and try to execute job steps one by one in real time on the command line. Being able to see exactly what is happening with each step of your job like this can provide insight into any errors that may be occurring.
  • Take advantage of the job mail features. You are able to get detailed information about your job when it starts and/or finishes by using the #SBATCH --mail-type= and #SBATCH --mail-user= options in your submissions scripts. This will give you detailed information about your job, including resource usage, which can be relevant if not enough resources were provided to a job. These job emails can help provide insight into many different types of errors.
  • Read software documentation. When using any software on an HPC cluster it is worth taking the time to familiarize yourself with the software's usage by reading its documentation. Some software will have its own website or Github repo where you can learn more about how to properly use software and leverage it to its fullest potential. If that documentation is lacking, you may be able to find helpful information about software by starting an interactive session on the cluster, loading the relevant software module, and looking for help output, by typing the software command followed by -h, --help, or nothing at all sometimes.
  • Be conscientious of software versions. Sometimes even experienced users of software may encounter strange or unexpected behavior if they're unknowingly using a different version of some software that they've used before. This could be something small like a change in the name of a command line option from one minor version of a software to the next, or something more drastic like accidentally using Python 2 when you meant to use Python 3.

Common Errors

File not found

This error can come up for a lot of reasons, but it's almost always easily fixed. If you encounter this error, check for the following things:

  • Possible misspelling the file name (or directory in its path) that you're trying to reference
  • Accidentally referencing a file with absolute path when you meant to reference it with a relative path or vice versa
  • That the file is actually where you think it is

Lmod has detected the following error: The following module(s) are unknown:

This error means that a module or modules you are trying to load do not exist. You may have made a typo when typing the module name, but you should look for the correct module names with the ml spider command. For example: ml spider Python.

Command not found

This error boils down to a command being used that is not in your PATH environment variable. Depending on the context, that could be for several reasons:

  • The name of or path to the command or script being used was misspelled
  • A software module needs to be loaded
  • A different version of a software module needs to be loaded
  • There could have been a problem with the installation of the software
  • If you've installed something in your home directory, you may need to add it to your PATH environmental variable, (for example: export PATH=/home/MyID/path/to/app:$PATH)
  • If you've installed a Conda environment in your home directory, you may need to activate it (source activate /home/MyID/path/to/CondaEnv)
  • If you're using a Singularity container, the command or script you're using may not be in its PATH environment variable, requiring you to reference the command or script with the absolute path to wherever it's located within the container. To explore this, you could start a shell in the container with singularity shell /apps/singularity-images/image.sif (replacing image.sif with the Singularity image you're using).

Permission denied

This error too depends on the context, but is straightforward. Depending on what is trying to be done, you may need to modify the permissions of one of your files or directories using the chmod command. In other scenarios, an application may be trying to do something in a central location, such as create a file, where you don't have write permission. In that case you may need to change the applications default output path with either a command line argument or perhaps a configuration file, depending on the application.

Invalid File format

Text files, such as submission scripts, fasta files or other text data in Linux should be in ASCII format. If the files are generated on other operating system or copied from the web, there could be hidden characters that Linux won't recognize.

Here is the command to check file format, using file sub.sh as an example file name below:

file sub.sh

The correct format is ASCII. There could be slightly different output from the file command, such as python script, ASCII with long lines. All these responses are acceptable as long as it is ASCII text format without any erroneous endings:

sub.sh: Bourne-Again shell script, ASCII text

Here are common incorrect formats:

sub.sh: Bourne-Again shell script, UTF-8 Unicode text executable
sub.sh: Bourne-Again shell script,  ASCII text, with CRLF line terminators

Most of the time it is easy to fix by saving the file in Unix format or using the following command at cluster:

dos2unix sub.sh

For files in UTF-8 Unicode encoding, the easiest way is to manually remove the weird characters in a Linux text editor, such as vi, nano or pico.

Invalid DISPLAY variable

Python scripts using matplotlib may complain about a DISPLAY error. This is seen in applications such as faststructure, oligotyping, QIIME and others. For example, in QIIME:

ml qiime/1.9.1

python /apps/gb/qiime/bin/core_diversity_analyses.py 
...
File "/apps/gb/qiime/1.9.1/lib/python2.7/site-packages/matplotlib/backends/backend_qt5.py", line 138, in _create_qApp
    raise RuntimeError('Invalid DISPLAY variable')
RuntimeError: Invalid DISPLAY variable

The solution is to define DISPLAY in an application-specific configuration file in your home dir:

echo "backend: Agg" > ~/.config/matplotlib/matplotlibrc

Unrecognized lines following backslash line continuation

In the case of using backslashes to continue the lines, there should be no extra space after any backslash. Otherwise the following lines will be parsed as separate commands and cause errors:

#!/bin/bash

#SBATCH --job-name=j_blast
#SBATCH --partition=batch
#SBATCH --ntasks=1
#SBATCH --mem=10gb
#SBATCH --time=8:00:00
#SBATCH --mail-user=username@uga.edu
#SBATCH --mail-type=ALL

module load ml BLAST+/2.13.0-gompi-2022a
blastn -num_threads 4 \
-db /db/ncbiblast/nrte/latest/nt \  
-query my_fasta.fa

There is a space after the backslash on the line "-db /db/ncbiblast/nrte/latest/nt \ ". This will cause error as "-query my_fasta.fa" would be interpreted an individual command. The blast command would throw its help prompt since there is no input file found. The other error could be like "Error: Too many positional arguments ..." due to the operand parameter line.

The solution is to simply remove the extra space(s).

Job Runs out of Memory

In some cases a job on an HPC cluster can fail because an application being used runs out of memory. When this happens, the operating system will kill the running process. You'll very often see the word "Killed" in the error message when this happens. You may also see the acronym "OOM," which stands for out of memory. Additionally, if you look at the state of your job via the seff or sacct command, you'll see that the value for the job's state is OUT_OF_MEMORY. The solution to this error will typically be to allocate more memory to your job via either the --mem or --mem-per-cpu Slurm headers.

Back to Top