EGAPx-Sapelo2
Category
Bioinformatics
Program On
N/A
Version
0.3.2-alpha
Author / Distributor
Please see https://github.com/ncbi/egapx
Description
From https://github.com/ncbi/egapx: "EGAPx is the publicly accessible version of the updated NCBI Eukaryotic Genome Annotation Pipeline. "
Running Program
Also refer to Running Jobs on Sapelo2
As of 05/20/2025, due to its current implementation of Nextflow, installing EGAPx centrally would potentially run the risk of overwriting important files (this might change in the future when the software is out of alpha/beta). As such, this page exists to outline the steps needed to run EGAPx from the /home directory.
First, start an interactive job and, in your /home directory, clone the repository and then cd to the new directory that this creates:
git clone https://github.com/ncbi/egapx.git cd egapx
Since the dependencies for EGAPx are already installed, you don't need to create a new virtual environment; simply load the relevant software modules:
ml Nextflow/23.10.1 ml Python/3.11.3-GCCcore-12.3.0 ml PyYAML/6.0-GCCcore-12.3.0
Now you need to run one of the test examples provided by the developers in order to acquire the necessary config files:
python3 ui/egapx.py ./examples/input_D_farinae_small.yaml -o example_out
Note that it isn't strictly necessary to include "python3" at the start of the command, so long as the 'egapx.py' script in the 'ui/' directory has executable permissions (which it does by default). Running the execution script the first time will put all of the config files into a new subdirectory in your working directory ('egapx/') called 'egapx_config'. Of these, the one to use is 'slurm.config', which you will need to open with your preferred text editor and make the following changes:
The line
cacheDir = "/data/$USER/singularity"
should be changed to
cacheDir = "$PWD/singularity"
while the section
env { SINGULARITY_CACHEDIR="/data/$USER/singularity" SINGULARITY_TMPDIR="/data/$USER/tmp" }
should be changed to:
env { SINGULARITY_CACHEDIR="$PWD/singularity" SINGULARITY_TMPDIR="/$PWD/tmp" }
Finally, in the section called "process," there are only three things that need to be un-commented (remove the "//" at the start of the line) and changed as follows:
queue = 'batch' queueSize = 20
and
clusterOptions = ' --ntasks=1 '
After this, you can create a job submission script that looks like this:
#!/bin/bash
#SBATCH --partition=batch
#SBATCH --job-name=ExampleJobName
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
#SBATCH --time=48:00:00
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
ml purge ml Nextflow/23.10.1 ml Python/3.11.3-GCCcore-12.3.0 ml PyYAML/6.0-GCCcore-12.3.0
python3 PATH/to/egapx/ui/egapx.py PATH/to/input_and_database_options.yaml -e slurm -o PATH/to/Output_Directory
In the real submission script, all the above underlined values need to be replaced by the actual values, particularly the relevant paths. Please note that, while it is perfectly fine to use EGAPx in your /home directory, the input and output directories (i.e., where files will be read from and written to) should NOT be located in your /home directory; please use somewhere else for input (if using local data) and output files, such your /scratch directory.
Example input YAML file:
genome: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/020/809/275/GCF_020809275.1_ASM2080927v1/GCF_020809275.1_ASM2080927v1_genomic.fna.gz taxid: 6954 reads: - https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/data/Dermatophagoides_farinae_small/SRR8506572.1 - https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/data/Dermatophagoides_farinae_small/SRR8506572.2 - https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/data/Dermatophagoides_farinae_small/SRR9005248.1 - https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/data/Dermatophagoides_farinae_small/SRR9005248.2
(Please see https://github.com/ncbi/egapx?tab=readme-ov-file#input-data-format for more information on the correct formatting for input data, and https://github.com/ncbi/egapx?tab=readme-ov-file#input-example for more examples.)
Documentation
Details and references are at: https://github.com/ncbi/egapx
Full help options
$ ui/egapx.py -h usage: egapx.py [-h] [-o OUTPUT] [-e EXECUTOR] [-c CONFIG_DIR] [-w WORKDIR] [-r REPORT] [-n] [-st] [-so] [-lo] [-ot ORTHO_TAXID] [-dl] [-lc LOCAL_CACHE] [-q] [-v] [-V] [-fn FUNC_NAME] [filename] Main script for EGAPx options: -h, --help show this help message and exit -e EXECUTOR, --executor EXECUTOR Nextflow executor, one of docker, singularity, aws, or local (for NCBI internal use only). Uses corresponding Nextflow config file -c CONFIG_DIR, --config-dir CONFIG_DIR Directory for executor config files, default is ./egapx_config. Can be also set as env EGAPX_CONFIG_DIR -w WORKDIR, --workdir WORKDIR Working directory for cloud executor -r REPORT, --report REPORT Report file prefix for report (.report.html) and timeline (.timeline.html) files, default is in output directory -n, --dry-run -st, --stub-run -so, --summary-only Print result statistics only if available, do not compute result -lo, --logs-only Collect execution logs if available, put them in output directory, do not compute result -ot ORTHO_TAXID, --ortho-taxid ORTHO_TAXID Taxid of reference data for orthology tasks -lc LOCAL_CACHE, --local-cache LOCAL_CACHE Where to store the downloaded files -q, --quiet -v, --verbose -V, --version Report software version -fn FUNC_NAME, --func_name FUNC_NAME func_name run: filename YAML file with input: section with at least genome: and reads: parameters -o OUTPUT, --output OUTPUT Output path download: -dl, --download-only Download external files to local storage, so that future runs can be isolated