EGAPx-Sapelo2

From Research Computing Center Wiki
Revision as of 15:14, 20 May 2025 by Jordan (talk | contribs) (Initial creation of this page)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search


Category

Bioinformatics

Program On

N/A

Version

0.3.2-alpha

Author / Distributor

Please see https://github.com/ncbi/egapx

Description

From https://github.com/ncbi/egapx: "EGAPx is the publicly accessible version of the updated NCBI Eukaryotic Genome Annotation Pipeline. "

Running Program

Also refer to Running Jobs on Sapelo2

As of 05/20/2025, due to its current implementation of Nextflow, installing EGAPx centrally would potentially run the risk of overwriting important files (this might change in the future when the software is out of alpha/beta). As such, this page exists to outline the steps needed to run EGAPx from the /home directory.

First, start an interactive job and, in your /home directory, clone the repository and then cd to the new directory that this creates:

git clone https://github.com/ncbi/egapx.git
cd egapx

Since the dependencies for EGAPx are already installed, you don't need to create a new virtual environment; simply load the relevant software modules:

ml Nextflow/23.10.1
ml Python/3.11.3-GCCcore-12.3.0
ml PyYAML/6.0-GCCcore-12.3.0

Now you need to run one of the test examples provided by the developers in order to acquire the necessary config files:

python3 ui/egapx.py ./examples/input_D_farinae_small.yaml -o example_out 

Note that it isn't strictly necessary to include "python3" at the start of the command, so long as the 'egapx.py' script in the 'ui/' directory has executable permissions (which it does by default). Running the execution script the first time will put all of the config files into a new subdirectory in your working directory ('egapx/') called 'egapx_config'. Of these, the one to use is 'slurm.config', which you will need to open with your preferred text editor and make the following changes:

The line

cacheDir = "/data/$USER/singularity"

should be changed to

cacheDir = "$PWD/singularity"


while the section

env {
    SINGULARITY_CACHEDIR="/data/$USER/singularity"
    SINGULARITY_TMPDIR="/data/$USER/tmp"
}

should be changed to:

env {
    SINGULARITY_CACHEDIR="$PWD/singularity"
    SINGULARITY_TMPDIR="/$PWD/tmp"
}

Finally, in the section called "process," there are only three things that need to be un-commented (remove the "//" at the start of the line) and changed as follows:

queue = 'batch'
queueSize = 20

and

clusterOptions = ' --ntasks=1 '

After this, you can create a job submission script that looks like this:

#!/bin/bash
#SBATCH --partition=batch
#SBATCH --job-name=ExampleJobName
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
#SBATCH --time=48:00:00
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err

ml purge ml Nextflow/23.10.1 ml Python/3.11.3-GCCcore-12.3.0 ml PyYAML/6.0-GCCcore-12.3.0

python3 PATH/to/egapx/ui/egapx.py PATH/to/input_and_database_options.yaml -e slurm -o PATH/to/Output_Directory

In the real submission script, all the above underlined values need to be replaced by the actual values, particularly the relevant paths. Please note that, while it is perfectly fine to use EGAPx in your /home directory, the input and output directories (i.e., where files will be read from and written to) should NOT be located in your /home directory; please use somewhere else for input (if using local data) and output files, such your /scratch directory.

Example input YAML file:

genome: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/020/809/275/GCF_020809275.1_ASM2080927v1/GCF_020809275.1_ASM2080927v1_genomic.fna.gz
taxid: 6954
reads:
  - https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/data/Dermatophagoides_farinae_small/SRR8506572.1
  - https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/data/Dermatophagoides_farinae_small/SRR8506572.2
  - https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/data/Dermatophagoides_farinae_small/SRR9005248.1
  - https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/data/Dermatophagoides_farinae_small/SRR9005248.2

(Please see https://github.com/ncbi/egapx?tab=readme-ov-file#input-data-format for more information on the correct formatting for input data, and https://github.com/ncbi/egapx?tab=readme-ov-file#input-example for more examples.)

Documentation

Details and references are at: https://github.com/ncbi/egapx

Full help options

$ ui/egapx.py -h
usage: egapx.py [-h] [-o OUTPUT] [-e EXECUTOR] [-c CONFIG_DIR] [-w WORKDIR] [-r REPORT] [-n] [-st] [-so] [-lo] [-ot ORTHO_TAXID] [-dl] [-lc LOCAL_CACHE] [-q] [-v] [-V] [-fn FUNC_NAME] [filename]

Main script for EGAPx

options:
  -h, --help            show this help message and exit
  -e EXECUTOR, --executor EXECUTOR
                        Nextflow executor, one of docker, singularity, aws, or local (for NCBI internal use only). Uses corresponding Nextflow config file
  -c CONFIG_DIR, --config-dir CONFIG_DIR
                        Directory for executor config files, default is ./egapx_config. Can be also set as env EGAPX_CONFIG_DIR
  -w WORKDIR, --workdir WORKDIR
                        Working directory for cloud executor
  -r REPORT, --report REPORT
                        Report file prefix for report (.report.html) and timeline (.timeline.html) files, default is in output directory
  -n, --dry-run
  -st, --stub-run
  -so, --summary-only   Print result statistics only if available, do not compute result
  -lo, --logs-only      Collect execution logs if available, put them in output directory, do not compute result
  -ot ORTHO_TAXID, --ortho-taxid ORTHO_TAXID
                        Taxid of reference data for orthology tasks
  -lc LOCAL_CACHE, --local-cache LOCAL_CACHE
                        Where to store the downloaded files
  -q, --quiet
  -v, --verbose
  -V, --version         Report software version
  -fn FUNC_NAME, --func_name FUNC_NAME
                        func_name

run:
  filename              YAML file with input: section with at least genome: and reads: parameters
  -o OUTPUT, --output OUTPUT
                        Output path

download:
  -dl, --download-only  Download external files to local storage, so that future runs can be isolated

Back to Top