Flankophile-Sapelo2: Difference between revisions
(Updated the steps for using the pre-built pipeline so that it explicitly instructs users to enter into an interactive session before loading the Flankophile module) |
m (Contents of "Description" also changed to sub-heading 1. Corrected back to paragraph) |
||
(6 intermediate revisions by the same user not shown) | |||
Line 3: | Line 3: | ||
=== Program On === | === Program On === | ||
[https://bitbucket.org/genomicepidemiology/flankophile/src/master/ Bitbucket] | |||
=== Version === | === Version === | ||
Line 11: | Line 11: | ||
[https://bitbucket.org/genomicepidemiology/ Genomic Epidemiology] | [https://bitbucket.org/genomicepidemiology/ Genomic Epidemiology] | ||
Description | === Description === | ||
"''Flankophile is a bioinformatics pipeline for gene synteny analysis.''" [https://bitbucket.org/genomicepidemiology/flankophile/src/master/ Bitbucket] | "''Flankophile is a bioinformatics pipeline for gene synteny analysis.''" [https://bitbucket.org/genomicepidemiology/flankophile/src/master/ Bitbucket] | ||
=== Running Program === | === Running Program === | ||
Please also refer to [[Running Jobs on Sapelo2]]. | Please also refer to [[Running Jobs on Sapelo2]]. | ||
==== Requirements ==== | ==== Requirements ==== | ||
Line 29: | Line 25: | ||
## One or more FASTA files that ''Flankophile'' should use as input. | ## One or more FASTA files that ''Flankophile'' should use as input. | ||
# '''[[Flankophile-Sapelo2#Example .tsv Metadata File|A tab-separated value (''.tsv'') metadata file]]''' | # '''[[Flankophile-Sapelo2#Example .tsv Metadata File|A tab-separated value (''.tsv'') metadata file]]''' | ||
## Describes the assembly | ## Describes the assembly names, their filepaths, and the genus for each input FASTA file. | ||
# '''[[Flankophile-Sapelo2#Example .yaml Configuration File|An analysis configuration file in YAML (''.yaml'')]]''' | # '''[[Flankophile-Sapelo2#Example .yaml Configuration File|An analysis configuration file in YAML (''.yaml'')]]''' | ||
## Describes the analysis for ''Flankophile'' to perform and the parameters to use during analysis. | ## Describes the analysis for ''Flankophile'' to perform and the parameters to use during analysis. | ||
## Should be located at the root of the ''Flankophile'' repository directory. | ## Should be located at the root of the ''Flankophile'' repository directory. | ||
==== | ==== Setup ==== | ||
The steps to setup ''Flankophile'' are as follows: | |||
The steps to | |||
# '''Enter into an interactive session''' The installation of ''Flankophile'' involves an amount of resource utilization that should be performed by a compute node and not a submit node. | # '''Enter into an interactive session''' The installation of ''Flankophile'' involves an amount of resource utilization that should be performed by a compute node and not a submit node. | ||
Line 60: | Line 42: | ||
# '''Change the working directory to the newly cloned ''flankophile/'' repository''' | # '''Change the working directory to the newly cloned ''flankophile/'' repository''' | ||
## <code>cd flankophile</code> | ## <code>cd flankophile</code> | ||
# '''Load the Snakemake module.''' | |||
## <code>module load snakemake/6.9.1-Mamba-4.11.0-4</code> | |||
# '''Create the conda environments that ''Flankophile'' will use at runtime''' In some tests, creating the conda environments at runtime causes the pipeline to fail due to missing or incompatible dependencies. To avoid this, create the conda environments ahead of time. '''<u>This step will take ~15-25 minutes to complete.</u>''' | # '''Create the conda environments that ''Flankophile'' will use at runtime''' In some tests, creating the conda environments at runtime causes the pipeline to fail due to missing or incompatible dependencies. To avoid this, create the conda environments ahead of time. '''<u>This step will take ~15-25 minutes to complete.</u>''' | ||
## <code>snakemake --use-conda --conda-create-envs-only --cores 16</code> | ## <code>snakemake --use-conda --conda-create-envs-only --cores 16</code> | ||
==== Example Files ==== | ==== Example Files ==== | ||
As listed above, ''Flankophile'' requires | As listed above, ''Flankophile'' requires <u>input files</u>, a <u>metadata file</u>, and a <u>config file</u> to run. | ||
===== Example Input Files ===== | ===== Example Input Files ===== | ||
Unfortunately, the ''Flankophile'' pipeline does not come with any input files to use as examples. | |||
===== Example ''.tsv'' Metadata File ===== | |||
The following text is an example of a metadata file's contents: | |||
====== <code>/scratch/$USER/flankophile/metadata.tsv</code> ====== | |||
<syntaxhighlight> | |||
#assembly_name path metadata | |||
assembly32 /scratch/$USER/e_coli-assembly32.fa Escherichia | |||
assembly25 /scratch/$USER/p_vulgaris-assembly25.fa Proteus | |||
</syntaxhighlight><u>When creating a</u> ''<code>metadata.tsv</code>'' <u>file for your own data, be sure to take note of its filepath</u><u>.</u> This will be listed after ''<code>input_list:</code>'' in the ''<code>config.yaml</code>'' file, shown below. | |||
===== Example ''. | ===== Example ''.yaml'' Configuration File ===== | ||
The following text is an example of a ''Flankophile'' configuration file's contents: | |||
< | ====== <code>/scratch/$USER/flankophile/config.yaml</code> ====== | ||
<syntaxhighlight lang="yaml"> | |||
# FLANKOPHILE version 0.2.10 | # FLANKOPHILE version 0.2.10 | ||
# Alix Vincent Thorn | # Alix Vincent Thorn | ||
Line 111: | Line 78: | ||
database: "input/example_input_files/ResFinder_08_02_2022.fa" | database: "input/example_input_files/ResFinder_08_02_2022.fa" | ||
input_list: " | input_list: "/scratch/$USER/flankophile/metadata.tsv" # Edit this! | ||
min_coverage_abricate: "98" # Minimum coverage in percentage compared to reference sequence. | min_coverage_abricate: "98" # Minimum coverage in percentage compared to reference sequence. | ||
Line 141: | Line 108: | ||
# 256 Cosine distance | # 256 Cosine distance | ||
# 4096 Chi-square distance | # 4096 Chi-square distance | ||
</syntaxhighlight> | </syntaxhighlight>This configuration file contains the path to the metadata file, which contains the paths to the input data. | ||
<u>When creating a ''<code>config.yaml</code>'' file for your own data, be sure to</u> <u>include the location of the</u> <code>''metadata.tsv''</code> <u>file you created.</u> | |||
<u>When | |||
==== Starting the Pipeline ==== | ==== Starting the Pipeline ==== | ||
Line 162: | Line 125: | ||
## <code>module load snakemake/6.9.1-Mamba-4.11.0-4</code> | ## <code>module load snakemake/6.9.1-Mamba-4.11.0-4</code> | ||
# '''Identify the name of the ''Conda'' environment to activate.''' | # '''Identify the name of the ''Conda'' environment to activate.''' | ||
## <code>ls .snakemake/conda/</code> | ##<code>ls .snakemake/conda/</code> | ||
## This will be a random series of numbers and letters. | |||
# '''Activate the ''Conda'' environment that ''Flankophile'' will use.''' | # '''Activate the ''Conda'' environment that ''Flankophile'' will use.''' | ||
## <code>source activate / | ##<code>source activate /scratch/''$USER''/flankophile/.snakemake/conda/'''50e62e607e6a24bb70ce9a5a50888445'''</code> | ||
##Make sure to adjust the command according to the name of the environment, identified previously. | |||
# '''Run the pipeline.''' | # '''Run the pipeline.''' | ||
## <code>snakemake --use-conda --cores 16</code> | ## <code>snakemake --use-conda --cores 16</code> | ||
Line 189: | Line 154: | ||
</syntaxhighlight>'''Please note that you need to identify the correct name of the ''Conda'' environment just as you would if running the pipeline from within an interactive session.''' | </syntaxhighlight>'''Please note that you need to identify the correct name of the ''Conda'' environment just as you would if running the pipeline from within an interactive session.''' | ||
Assuming <code>flankophile/</code> was | Assuming <code>flankophile/</code> was cloned into your user's <code>/scratch</code> directory, use the following steps to identify the name of the ''Conda'' environment: | ||
# '''Identify the name of the ''Conda'' environment to activate.''' | # '''Identify the name of the ''Conda'' environment to activate.''' | ||
Line 195: | Line 160: | ||
Make sure to adjust the job submission script so that the correct ''Conda'' environment is loaded. | Make sure to adjust the job submission script so that the correct ''Conda'' environment is loaded. | ||
=== Running Pipeline Using ''/lscratch'' === | |||
To learn more about ''lscratch'' usage, please refer to the sections from our wiki, ''[[Running Jobs on Sapelo2#How to run a job using the local scratch .2Flscratch on a compute node|How to run a job using the local scratch /lscratch on a compute node]]'' and ''[[Disk Storage#lscratch file system|lscratch file system]]''. | |||
Below is an outline describing how ''Flankophile'' can be run using a compute node's ''/lscratch'' space. It assumes the user has a directory named ''<code>/home/$USER/flankophile/</code>'' that contains <code>''config.yaml''</code>, <code>''metadata.csv''</code>, and the input files. Keeping in line with the previous examples, the names of the input files would then be ''<code>p_vulgaris-assembly25.fa</code>'' and <code>''e_coli-assembly32.fa''</code>. The ''<code>flankophile/</code>'' directory in the user's home would contain these input files as well as the job submission script, ''<code>flankophile-sub.sh</code>''. | |||
==== List of Files for ''/lscratch'' Usage: ==== | |||
The following is a list of the files and their expected locations for this guide: | |||
# ''<code>/home/$USER/flankophile/config.yaml</code>'' | |||
# ''<code>/home/$USER/flankophile/metadata.tsv</code>'' | |||
# ''<code>/home/$USER/flankophile/p_vulgaris-assembly25.fa</code>'' | |||
# ''<code>/home/$USER/flankophile/e_coli-assembly32.fa</code>'' | |||
# ''<code>/home/$USER/flankophile/flankophile-sub.sh</code>'' | |||
==== <code>/home/$USER/flankophile/config.yaml</code> ==== | |||
This configuration file differs from the above [[Flankophile-Sapelo2#Example .yaml Configuration File|Example ''.yaml'' Configuration File]] by the specified path to the ''<code>metadata.tsv</code>'' file. Instead of listing the path to the ''<code>metadata.tsv</code>'' file as a subdirectory of the user's ''/scratch'' directory, this configuration file should list the path to the ''<code>metadata.tsv</code>'' file as a subdirectory of the user's ''/home'' directory. It is acceptable to store the ''<code>metadata.tsv</code>'' and input files in the ''/home'' directory because ''Flankophile'' will only '''read''' from these files during its execution; any heavy '''writing''' will occur in the ''/lscratch'' directory where the pipeline and its dependencies will be installed (see ''[[Disk Storage#Home file system|Home_file_system]]'' for further information on proper usage of ''/home'' storage). | |||
Make sure to replace any instances of ''<code>$USER</code>'' with your actual username:<syntaxhighlight lang="yaml"> | |||
# FLANKOPHILE version 0.2.10 | |||
# Alix Vincent Thorn | |||
## 1 ####################################################################################### | |||
database: "input/example_input_files/ResFinder_08_02_2022.fa" | |||
input_list: "/home/$USER/flankophile/metadata.tsv" # Edit this! | |||
min_coverage_abricate: "98" # Minimum coverage in percentage compared to reference sequence. | |||
min_identity_abricate: "98" # Minimum percentage identity compared to reference sequence. | |||
## 2 ###################################################################################### | |||
flank_length_upstreams: "1500" | |||
flank_length_downstreams: "1500" | |||
## 3 ####################################################################################### | |||
cluster_identity_and_length_diff: "0.98" # CD-HIT parameters = c and s. 0.98 = cluster at 98 % identity and a maximum length difference of 98 % | |||
## 4 ###################################################################################### | |||
k-mer_size: "16" # Kmersize used by kma index. Try with "16" if in doubt. | |||
distance_measure: "1" # Choose measurement for making distance matrix | |||
# Distance calculation methods: | |||
# | |||
# 1 k-mer hamming distance | |||
# 64 Jaccard distance | |||
# 256 Cosine distance | |||
# 4096 Chi-square distance | |||
</syntaxhighlight> | |||
==== <code>/home/$USER/flankophile/metadata.tsv</code> ==== | |||
Similar to the ''<code>config.yaml</code>'' file, this ''<code>metadata.tsv</code>'' file differs from the above [[Flankophile-Sapelo2#Example .tsv Metadata File|Example ''.tsv'' Metadata File]] by the specified paths it contains. Note also that the paths it contains are in a subdirectory of the ''/home'' directory, ''<code>flankophile/</code>'', instead of in the ''/home'' directory directly. | |||
Make sure to replace any instances of ''<code>$USER</code>'' with your actual username, and to specify the paths to your actual data:<syntaxhighlight> | |||
#assembly_name path metadata | |||
assembly32 /home/$USER/flankophile/e_coli-assembly32.fa Escherichia | |||
assembly25 /home/$USER/flankophile/p_vulgaris-assembly25.fa Proteus | |||
</syntaxhighlight> | |||
==== <code>/home/$USER/flankophile/flankophile-sub.sh</code> ==== | |||
The major differences between running ''Flankophile'' using the ''/lscratch'' space and running ''Flankophile'' elsewhere is that the working directory inside ''/lscratch'' must be created before execution and removed after execution. These steps are handled by the following job submission script:<syntaxhighlight lang="bash"> | |||
#!/usr/bin/env bash | |||
#SBATCH --job-name=Flankophile | |||
#SBATCH --partition=batch | |||
#SBATCH --nodes=1 | |||
#SBATCH --gres=lscratch:30 | |||
#SBATCH --ntasks=1 | |||
#SBATCH --cpus-per-task=16 | |||
#SBATCH --mem=32G | |||
#SBATCH --time=05:00:00 | |||
#SBATCH --output=log.%j.out | |||
#SBATCH --error=log.%j.err | |||
cd $SLURM_SUBMIT_DIR | |||
# Step 1 - Create working directory in /lscratch | |||
mkdir -p /lscratch/${USER}/${SLURM_JOB_ID} | |||
# Step 2 - cd to /lscratch working directory | |||
cd /lscratch/${USER}/${SLURM_JOB_ID} | |||
# Step 3 - Clone and cd into pipeline | |||
git clone https://bitbucket.org/genomicepidemiology/flankophile/src/34f286e00088d12019dc30cb2c38c103f0efe506/ flankophile | |||
cd flankophile | |||
# Step 4 - Load snakemake then build required environments | |||
module load snakemake/6.9.1-Mamba-4.11.0-4 | |||
snakemake --use-conda --conda-create-envs-only --cores $SLURM_CPUS_PER_TASK | |||
# Step 5 - Copy config.yaml into /lscratch working directory | |||
cp ${SLURM_SUBMIT_DIR}/config.yaml ./ | |||
# Step 6 - Activate newly created environment | |||
source activate $(ls -d .snakemake/conda/*/) | |||
# Step 7 - Run the pipeline | |||
snakemake --use-conda --cores $SLURM_CPUS_PER_TASK | |||
# Step 8 - Copy the output directory into submit directory | |||
cp -r output/ ${SLURM_SUBMIT_DIR}/ | |||
# Step 9 - Clean up by removing the /lscratch working directory | |||
rm -rf /lscratch/${USER}/${SLURM_JOB_ID} | |||
</syntaxhighlight> | |||
===== Step-by-Step Breakdown of ''<code>flankophile-sub.sh</code>'' ===== | |||
# '''<code>Step 1 - Create working directory in /lscratch</code>'''<syntaxhighlight lang="bash"> | |||
mkdir -p /lscratch/${USER}/${SLURM_JOB_ID} | |||
</syntaxhighlight> | |||
## The above command creates a directory for the job to use. | |||
# <code>'''Step 2 - cd to /lscratch working directory'''</code><syntaxhighlight lang="bash"> | |||
cd /lscratch/${USER}/${SLURM_JOB_ID} | |||
</syntaxhighlight> | |||
## The above command changes the current working directory to the directory created in the previous step. | |||
# '''<code>Step 3 - Clone and cd into pipeline</code>'''<syntaxhighlight lang="bash"> | |||
git clone https://bitbucket.org/genomicepidemiology/flankophile/src/34f286e00088d12019dc30cb2c38c103f0efe506/ flankophile | |||
cd flankophile | |||
</syntaxhighlight> | |||
## The first of the above commands clones the ''Flankophile'' repository into a directory named ''<code>flankophile/</code>''. | |||
### The full path to the job's current directory would be <code>''/lscratch/${USER}/${SLURM_JOB_ID}/flankophile''</code>. | |||
## The second of the above commands changes the current working directory into the repository. | |||
# '''<code>Step 4 - Load snakemake then build the required environments</code>'''<syntaxhighlight lang="bash"> | |||
module load snakemake/6.9.1-Mamba-4.11.0-4 | |||
snakemake --use-conda --conda-create-envs-only --cores $SLURM_CPUS_PER_TASK | |||
</syntaxhighlight> | |||
## The first of the above commands loads the correct version of ''Snakemake'' that is required by ''Flankophile''. | |||
## The second of the above commands creates the ''Conda'' environments that ''Flankophile'' will use at runtime. | |||
### The ''<code>$SLURM_CPUS_PER_TASK</code>'' variable is inhereted from the job's environment and is equal to the value set for the ''<code>--cpus-per-task</code>'' variable in the job submission script's header. | |||
# '''<code>Step 5 - Copy config.yaml into /lscratch working directory</code>'''<syntaxhighlight lang="bash"> | |||
cp ${SLURM_SUBMIT_DIR}/config.yaml ./ | |||
</syntaxhighlight> | |||
## The above command copies the ''<code>config.yaml</code>'' file into the job's current working directory, ''<code>/lscratch/${USER}/${SLURM_JOB_ID}/flankophile</code>''. | |||
## Because the command expects the ''<code>config.yaml</code>'' file to exist in the directory from which the job was submitted, <code>''No such file or directory''</code> errors can occur if the job is submitted from elsewhere. | |||
### In other words, '''<u>make sure to <code>cd</code> into</u> <code>''/home/$USER/flankophile''</code> <u>before executing</u> <code>''sbatch flankophile-sub.sh''</code><u>.</u>''' | |||
# '''<code>Step 6 - Activate newly created environment</code>'''<syntaxhighlight lang="bash"> | |||
source activate $(ls -d .snakemake/conda/*/) | |||
</syntaxhighlight> | |||
## The above command activates the ''Conda'' environment created in '''<code>Step 4</code>'''. | |||
### Because the ''Flankophile'' pipeline is created at the beginning of the job, and then deleted at the end, the subcommand, ''<code>$(ls -d .snakemake/conda/*/)</code>'', will return the correct name of the ''Conda'' environment. | |||
# '''<code>Step 7 - Run the pipeline</code>'''<syntaxhighlight lang="bash"> | |||
snakemake --use-conda --cores $SLURM_CPUS_PER_TASK | |||
</syntaxhighlight> | |||
## The above command starts the ''Flankophile'' pipeline while specifying the number of cores to use with the ''<code>$SLURM_CPUS_PER_TASK</code>'' environment variable. | |||
# '''<code>Step 8 - Copy the output directory into submit directory</code>'''<syntaxhighlight lang="bash"> | |||
cp -r output/ ${SLURM_SUBMIT_DIR}/ | |||
</syntaxhighlight> | |||
## The above command copies the ''<code>output/</code>'' directory and its contents recursively into the directory from which the job was submitted. | |||
### This would be equal to <code>''cp -r /lscratch/${USER}/${SLURM_JOB_ID}/flankophile/output/ /home/$USER/flankophile''</code>. | |||
# '''<code>Step 9 - Clean up by removing the /lscratch working directory</code>'''<syntaxhighlight lang="bash"> | |||
rm -rf /lscratch/${USER}/${SLURM_JOB_ID} | |||
</syntaxhighlight> | |||
## The above command prevents files no longer in use from consuming space on ''/lscratch'' by explicitly removing the job's working directory as a last step. | |||
==== Submitting the ''/lscratch'' Job ==== | |||
Be sure to <code>cd</code> into the ''<code>flankophile/</code>'' directory prior to submitting the job. | |||
# <syntaxhighlight lang="bash"> | |||
cd /home/$USER/flankophile | |||
</syntaxhighlight> | |||
# <syntaxhighlight lang="bash"> | |||
sbatch flankophile-sub.sh | |||
</syntaxhighlight> | |||
=== Installation === | === Installation === | ||
* Version 0.2.10: | * Version 0.2.10: Install in your user's scratch directory. | ||
=== System === | === System === | ||
* 64-bit Linux | * 64-bit Linux |
Latest revision as of 10:11, 18 March 2024
Category
Pipeline
Program On
Version
Author / Distributor
Description
"Flankophile is a bioinformatics pipeline for gene synteny analysis." Bitbucket
Running Program
Please also refer to Running Jobs on Sapelo2.
Requirements
Flankophile requires the following:
- The Flankophile pipeline
- Contains the pipeline's files and all required conda environments.
- Input sequence data in FASTA (.fa) format
- One or more FASTA files that Flankophile should use as input.
- A tab-separated value (.tsv) metadata file
- Describes the assembly names, their filepaths, and the genus for each input FASTA file.
- An analysis configuration file in YAML (.yaml)
- Describes the analysis for Flankophile to perform and the parameters to use during analysis.
- Should be located at the root of the Flankophile repository directory.
Setup
The steps to setup Flankophile are as follows:
- Enter into an interactive session The installation of Flankophile involves an amount of resource utilization that should be performed by a compute node and not a submit node.
interact --cpus-per-task 16 --mem 32GB
- Change the working directory to your user's
/scratch
directory Like other pipelines, the Flankophile installation directory should be in a location suited for large amounts of sustained I/O. Do not install Flankophile in the/home
directory; the filesystem that hosts/home
is not suited for this workload.cd /scratch/$USER
- Clone the Flankophile repository at the desired commit The versions of Flankophile are not labeled with tags in the project's Bitbucket page. Instead, the version is stated in the commit message; this guide will use the commit corresponding to version 0.2.10 (34f286e).
git clone https://bitbucket.org/genomicepidemiology/flankophile/src/34f286e00088d12019dc30cb2c38c103f0efe506/ flankophile
- The
...0efe506/ flankophile
at the end instructsgit
to clone the repository into the directory,flankophile
.
- The
- Change the working directory to the newly cloned flankophile/ repository
cd flankophile
- Load the Snakemake module.
module load snakemake/6.9.1-Mamba-4.11.0-4
- Create the conda environments that Flankophile will use at runtime In some tests, creating the conda environments at runtime causes the pipeline to fail due to missing or incompatible dependencies. To avoid this, create the conda environments ahead of time. This step will take ~15-25 minutes to complete.
snakemake --use-conda --conda-create-envs-only --cores 16
Example Files
As listed above, Flankophile requires input files, a metadata file, and a config file to run.
Example Input Files
Unfortunately, the Flankophile pipeline does not come with any input files to use as examples.
Example .tsv Metadata File
The following text is an example of a metadata file's contents:
/scratch/$USER/flankophile/metadata.tsv
#assembly_name path metadata
assembly32 /scratch/$USER/e_coli-assembly32.fa Escherichia
assembly25 /scratch/$USER/p_vulgaris-assembly25.fa Proteus
When creating a metadata.tsv
file for your own data, be sure to take note of its filepath. This will be listed after input_list:
in the config.yaml
file, shown below.
Example .yaml Configuration File
The following text is an example of a Flankophile configuration file's contents:
/scratch/$USER/flankophile/config.yaml
# FLANKOPHILE version 0.2.10
# Alix Vincent Thorn
## 1 #######################################################################################
database: "input/example_input_files/ResFinder_08_02_2022.fa"
input_list: "/scratch/$USER/flankophile/metadata.tsv" # Edit this!
min_coverage_abricate: "98" # Minimum coverage in percentage compared to reference sequence.
min_identity_abricate: "98" # Minimum percentage identity compared to reference sequence.
## 2 ######################################################################################
flank_length_upstreams: "1500"
flank_length_downstreams: "1500"
## 3 #######################################################################################
cluster_identity_and_length_diff: "0.98" # CD-HIT parameters = c and s. 0.98 = cluster at 98 % identity and a maximum length difference of 98 %
## 4 ######################################################################################
k-mer_size: "16" # Kmersize used by kma index. Try with "16" if in doubt.
distance_measure: "1" # Choose measurement for making distance matrix
# Distance calculation methods:
#
# 1 k-mer hamming distance
# 64 Jaccard distance
# 256 Cosine distance
# 4096 Chi-square distance
This configuration file contains the path to the metadata file, which contains the paths to the input data.
When creating a config.yaml
file for your own data, be sure to include the location of the metadata.tsv
file you created.
Starting the Pipeline
Once the required files are in place, using either the above examples or your own inputs, the pipeline can be started in either an interactive session or within a job submission script.
In an Interactive Session
The procedure to start the Flankophile pipeline in an interactive session is as follows:
- Enter into an interactive session if not already inside one.
interact --cpus-per-task 16 --mem 32GB
- Change the working directory to the location of the
flankophile/
directory. (The following command assumes it is in the base of your/scratch
directory)cd /scratch/$USER/flankophile/
- Load the snakemake module into your environment. (Flankophile requires snakemake 6.9.1 with Miniconda/Mamba 4.11.0)
module load snakemake/6.9.1-Mamba-4.11.0-4
- Identify the name of the Conda environment to activate.
ls .snakemake/conda/
- This will be a random series of numbers and letters.
- Activate the Conda environment that Flankophile will use.
source activate /scratch/$USER/flankophile/.snakemake/conda/50e62e607e6a24bb70ce9a5a50888445
- Make sure to adjust the command according to the name of the environment, identified previously.
- Run the pipeline.
snakemake --use-conda --cores 16
In a Job Submission Script
The following is an example of a job submission script that runs the Flankophile pipeline:
#!/bin/bash
#SBATCH --job-name=Flankophile
#SBATCH --partition=batch
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=32gb
#SBATCH --time=01:00:00
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
module load snakemake/6.9.1-Mamba-4.11.0-4
# The following assumes flankophile/ was copied into the base of your user's /scratch
cd /scratch/$USER/flankophile/
source activate /source/$USER/flankophile/.snakemake/conda/50e62e607e6a24bb70ce9a5a50888445
snakemake --use-conda --cores 16
Please note that you need to identify the correct name of the Conda environment just as you would if running the pipeline from within an interactive session.
Assuming flankophile/
was cloned into your user's /scratch
directory, use the following steps to identify the name of the Conda environment:
- Identify the name of the Conda environment to activate.
ls /scratch/$USER/flankophile/.snakemake/conda/
Make sure to adjust the job submission script so that the correct Conda environment is loaded.
Running Pipeline Using /lscratch
To learn more about lscratch usage, please refer to the sections from our wiki, How to run a job using the local scratch /lscratch on a compute node and lscratch file system.
Below is an outline describing how Flankophile can be run using a compute node's /lscratch space. It assumes the user has a directory named /home/$USER/flankophile/
that contains config.yaml
, metadata.csv
, and the input files. Keeping in line with the previous examples, the names of the input files would then be p_vulgaris-assembly25.fa
and e_coli-assembly32.fa
. The flankophile/
directory in the user's home would contain these input files as well as the job submission script, flankophile-sub.sh
.
List of Files for /lscratch Usage:
The following is a list of the files and their expected locations for this guide:
/home/$USER/flankophile/config.yaml
/home/$USER/flankophile/metadata.tsv
/home/$USER/flankophile/p_vulgaris-assembly25.fa
/home/$USER/flankophile/e_coli-assembly32.fa
/home/$USER/flankophile/flankophile-sub.sh
/home/$USER/flankophile/config.yaml
This configuration file differs from the above Example .yaml Configuration File by the specified path to the metadata.tsv
file. Instead of listing the path to the metadata.tsv
file as a subdirectory of the user's /scratch directory, this configuration file should list the path to the metadata.tsv
file as a subdirectory of the user's /home directory. It is acceptable to store the metadata.tsv
and input files in the /home directory because Flankophile will only read from these files during its execution; any heavy writing will occur in the /lscratch directory where the pipeline and its dependencies will be installed (see Home_file_system for further information on proper usage of /home storage).
Make sure to replace any instances of $USER
with your actual username:
# FLANKOPHILE version 0.2.10
# Alix Vincent Thorn
## 1 #######################################################################################
database: "input/example_input_files/ResFinder_08_02_2022.fa"
input_list: "/home/$USER/flankophile/metadata.tsv" # Edit this!
min_coverage_abricate: "98" # Minimum coverage in percentage compared to reference sequence.
min_identity_abricate: "98" # Minimum percentage identity compared to reference sequence.
## 2 ######################################################################################
flank_length_upstreams: "1500"
flank_length_downstreams: "1500"
## 3 #######################################################################################
cluster_identity_and_length_diff: "0.98" # CD-HIT parameters = c and s. 0.98 = cluster at 98 % identity and a maximum length difference of 98 %
## 4 ######################################################################################
k-mer_size: "16" # Kmersize used by kma index. Try with "16" if in doubt.
distance_measure: "1" # Choose measurement for making distance matrix
# Distance calculation methods:
#
# 1 k-mer hamming distance
# 64 Jaccard distance
# 256 Cosine distance
# 4096 Chi-square distance
/home/$USER/flankophile/metadata.tsv
Similar to the config.yaml
file, this metadata.tsv
file differs from the above Example .tsv Metadata File by the specified paths it contains. Note also that the paths it contains are in a subdirectory of the /home directory, flankophile/
, instead of in the /home directory directly.
Make sure to replace any instances of $USER
with your actual username, and to specify the paths to your actual data:
#assembly_name path metadata
assembly32 /home/$USER/flankophile/e_coli-assembly32.fa Escherichia
assembly25 /home/$USER/flankophile/p_vulgaris-assembly25.fa Proteus
/home/$USER/flankophile/flankophile-sub.sh
The major differences between running Flankophile using the /lscratch space and running Flankophile elsewhere is that the working directory inside /lscratch must be created before execution and removed after execution. These steps are handled by the following job submission script:
#!/usr/bin/env bash
#SBATCH --job-name=Flankophile
#SBATCH --partition=batch
#SBATCH --nodes=1
#SBATCH --gres=lscratch:30
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=32G
#SBATCH --time=05:00:00
#SBATCH --output=log.%j.out
#SBATCH --error=log.%j.err
cd $SLURM_SUBMIT_DIR
# Step 1 - Create working directory in /lscratch
mkdir -p /lscratch/${USER}/${SLURM_JOB_ID}
# Step 2 - cd to /lscratch working directory
cd /lscratch/${USER}/${SLURM_JOB_ID}
# Step 3 - Clone and cd into pipeline
git clone https://bitbucket.org/genomicepidemiology/flankophile/src/34f286e00088d12019dc30cb2c38c103f0efe506/ flankophile
cd flankophile
# Step 4 - Load snakemake then build required environments
module load snakemake/6.9.1-Mamba-4.11.0-4
snakemake --use-conda --conda-create-envs-only --cores $SLURM_CPUS_PER_TASK
# Step 5 - Copy config.yaml into /lscratch working directory
cp ${SLURM_SUBMIT_DIR}/config.yaml ./
# Step 6 - Activate newly created environment
source activate $(ls -d .snakemake/conda/*/)
# Step 7 - Run the pipeline
snakemake --use-conda --cores $SLURM_CPUS_PER_TASK
# Step 8 - Copy the output directory into submit directory
cp -r output/ ${SLURM_SUBMIT_DIR}/
# Step 9 - Clean up by removing the /lscratch working directory
rm -rf /lscratch/${USER}/${SLURM_JOB_ID}
Step-by-Step Breakdown of flankophile-sub.sh
Step 1 - Create working directory in /lscratch
mkdir -p /lscratch/${USER}/${SLURM_JOB_ID}
- The above command creates a directory for the job to use.
Step 2 - cd to /lscratch working directory
cd /lscratch/${USER}/${SLURM_JOB_ID}
- The above command changes the current working directory to the directory created in the previous step.
Step 3 - Clone and cd into pipeline
git clone https://bitbucket.org/genomicepidemiology/flankophile/src/34f286e00088d12019dc30cb2c38c103f0efe506/ flankophile cd flankophile
- The first of the above commands clones the Flankophile repository into a directory named
flankophile/
.- The full path to the job's current directory would be
/lscratch/${USER}/${SLURM_JOB_ID}/flankophile
.
- The full path to the job's current directory would be
- The second of the above commands changes the current working directory into the repository.
- The first of the above commands clones the Flankophile repository into a directory named
Step 4 - Load snakemake then build the required environments
module load snakemake/6.9.1-Mamba-4.11.0-4 snakemake --use-conda --conda-create-envs-only --cores $SLURM_CPUS_PER_TASK
- The first of the above commands loads the correct version of Snakemake that is required by Flankophile.
- The second of the above commands creates the Conda environments that Flankophile will use at runtime.
- The
$SLURM_CPUS_PER_TASK
variable is inhereted from the job's environment and is equal to the value set for the--cpus-per-task
variable in the job submission script's header.
- The
Step 5 - Copy config.yaml into /lscratch working directory
cp ${SLURM_SUBMIT_DIR}/config.yaml ./
- The above command copies the
config.yaml
file into the job's current working directory,/lscratch/${USER}/${SLURM_JOB_ID}/flankophile
. - Because the command expects the
config.yaml
file to exist in the directory from which the job was submitted,No such file or directory
errors can occur if the job is submitted from elsewhere.- In other words, make sure to
cd
into/home/$USER/flankophile
before executingsbatch flankophile-sub.sh
.
- In other words, make sure to
- The above command copies the
Step 6 - Activate newly created environment
source activate $(ls -d .snakemake/conda/*/)
- The above command activates the Conda environment created in
Step 4
.- Because the Flankophile pipeline is created at the beginning of the job, and then deleted at the end, the subcommand,
$(ls -d .snakemake/conda/*/)
, will return the correct name of the Conda environment.
- Because the Flankophile pipeline is created at the beginning of the job, and then deleted at the end, the subcommand,
- The above command activates the Conda environment created in
Step 7 - Run the pipeline
snakemake --use-conda --cores $SLURM_CPUS_PER_TASK
- The above command starts the Flankophile pipeline while specifying the number of cores to use with the
$SLURM_CPUS_PER_TASK
environment variable.
- The above command starts the Flankophile pipeline while specifying the number of cores to use with the
Step 8 - Copy the output directory into submit directory
cp -r output/ ${SLURM_SUBMIT_DIR}/
- The above command copies the
output/
directory and its contents recursively into the directory from which the job was submitted.- This would be equal to
cp -r /lscratch/${USER}/${SLURM_JOB_ID}/flankophile/output/ /home/$USER/flankophile
.
- This would be equal to
- The above command copies the
Step 9 - Clean up by removing the /lscratch working directory
rm -rf /lscratch/${USER}/${SLURM_JOB_ID}
- The above command prevents files no longer in use from consuming space on /lscratch by explicitly removing the job's working directory as a last step.
Submitting the /lscratch Job
Be sure to cd
into the flankophile/
directory prior to submitting the job.
cd /home/$USER/flankophile
sbatch flankophile-sub.sh
Installation
- Version 0.2.10: Install in your user's scratch directory.
System
- 64-bit Linux