Flankophile-Sapelo2: Difference between revisions

From Research Computing Center Wiki
Jump to navigation Jump to search
(Added "load Snakemake module" step to Manual setup section)
(Removed Flankophile module and removed sections that stated Flankophile was available as a module)
Line 3: Line 3:


=== Program On ===
=== Program On ===
Sapelo2
[https://bitbucket.org/genomicepidemiology/flankophile/src/master/ Bitbucket]


=== Version ===
=== Version ===
Line 17: Line 17:
=== Running Program ===
=== Running Program ===
Please also refer to [[Running Jobs on Sapelo2]].
Please also refer to [[Running Jobs on Sapelo2]].
* A module containing a pre-built ''Flankophile'' pipeline is available on Sapelo2.
** <code>module load Flankophile/0.2.10</code>


==== Requirements ====
==== Requirements ====
Line 34: Line 31:
## Should be located at the root of the ''Flankophile'' repository directory.
## Should be located at the root of the ''Flankophile'' repository directory.


==== Setting up the Pipeline ====
==== Setup ====
Before running ''Flankophile'', the pipeline must be cloned and the dependencies must be installed. For convenience, a ''Flankophile'' directory has already been setup in a module that can be copied and used directly.
The steps to setup ''Flankophile'' are as follows:
 
===== Pre-Built GACRC Setup =====
GACRC provides a complete ''Flankophile'' directory in a module.
 
To use it, enter into an interactive session, load the module, and copy the directory into your user's <code>/scratch</code> directory. It may take a while to copy the directory due to its size.
 
#<code>interact --cpus-per-task 16 --mem 32GB</code>
# <code>module load Flankophile/0.2.10</code>
# <code>cp -r $EBROOTFLANKOPHILE/flankophile /scratch/$USER/</code>
 
The ''Flankophile'' pipeline in this directory already contains all required conda environments, and is ready for use.
 
===== Manual Setup =====
The steps to manually setup ''Flankophile'' are as follows:


# '''Enter into an interactive session'''  The installation of ''Flankophile'' involves an amount of resource utilization that should be performed by a compute node and not a submit node.
# '''Enter into an interactive session'''  The installation of ''Flankophile'' involves an amount of resource utilization that should be performed by a compute node and not a submit node.
Line 66: Line 49:


==== Example Files ====
==== Example Files ====
As listed above, ''Flankophile'' requires input files, a metadata file, and a config file to run.
As listed above, ''Flankophile'' requires <u>input files</u>, a <u>metadata file</u>, and a <u>config file</u> to run.  
 
=====<u>Note for Examples:</u>=====
 
* In each of the following examples, <u>be sure to edit the files and replace all instances of</u> <code>''$USER''</code> <u>with your actual username.</u>


===== Example Input Files =====
===== Example Input Files =====
The example input files used were sourced from [https://github.com/cadms/resfinder/tree/master ''ResFinder''<nowiki/>'s GitHub project page]. For convenience, they can also be found in the directory, <code>$EBROOTFLANKOPHILE/flankophile/example/data</code>.  
Unfortunately, the ''Flankophile'' pipeline does not come with any input files to use as examples.


Assuming <code>flankophile/</code> was copied into your user's <code>/scratch</code> directory, they are at the following paths:
===== Example ''.tsv'' Metadata File =====


* <code>'''/scratch/''$USER''/flankophile/example/data/test_isolate_01.fa'''</code>
The following text is an example of a metadata file's contents:
* <code>'''/scratch/''$USER''/flankophile/example/data/test_isolate_02.fa'''</code>
* <code>'''/scratch/''$USER''/flankophile/example/data/test_isolate_03.fa'''</code>
* <code>'''/scratch/''$USER''/flankophile/example/data/test_isolate_05.fa'''</code>
* <code>'''/scratch/''$USER''/flankophile/example/data/test_isolate_09a.fa'''</code>
* <code>'''/scratch/''$USER''/flankophile/example/data/test_isolate_09b.fa'''</code>


The location for each of the above input files should be listed in the metadata file, shown next.
====== <code>/scratch/$USER/flankophile/metadata.tsv</code> ======
<syntaxhighlight>
#assembly_name    path    metadata
assembly32    /scratch/$USER/e_coli-assembly32.fa    Escherichia
assembly25    /scratch/$USER/p_vulgaris-assembly25.fa    Proteus
</syntaxhighlight><u>When creating a</u> ''<code>metadata.tsv</code>'' <u>file for your own data, be sure to take note of its filepath</u><u>.</u> This will be listed after ''<code>input_list:</code>'' in the ''<code>config.yaml</code>'' file, shown below.


===== Example ''.tsv'' Metadata File =====
===== Example ''.yaml'' Configuration File =====
The following text is an example of a metadata file's contents:<syntaxhighlight>
#assembly_name path metadata
test_isolate_01 /scratch/$USER/flankophile/example/data/test_isolate_01.fa Escherichia
test_isolate_02 /scratch/$USER/flankophile/example/data/test_isolate_02.fa Escherichia
test_isolate_03 /scratch/$USER/flankophile/example/data/test_isolate_03.fa Escherichia
test_isolate_05 /scratch/$USER/flankophile/example/data/test_isolate_05.fa Escherichia
test_isolate_09a /scratch/$USER/flankophile/example/data/test_isolate_09a.fa Escherichia
test_isolate_09b /scratch/$USER/flankophile/example/data/test_isolate_09b.fa Escherichia
</syntaxhighlight>Assuming <code>flankophile/</code> was copied into your user's <code>/scratch</code> directory, this metadata file is at the following path:
 
* <code>'''/scratch/''$USER''/flankophile/example/metadata.tsv'''</code>
 
<u>When using your own data, be sure to create a new</u> <code>''metadata.tsv''</code> <u>file that lists the paths to your input data instead of the example data.</u>


The location of the metadata will be listed in the configuration file, shown next.
The following text is an example of a ''Flankophile'' configuration file's contents:


===== Example ''.yaml'' Configuration File =====
====== <code>/scratch/$USER/flankophile/config.yaml</code> ======
The following text is an example of a ''Flankophile'' configuration file's contents:<syntaxhighlight lang="yaml">
<syntaxhighlight lang="yaml">
# FLANKOPHILE version 0.2.10
# FLANKOPHILE version 0.2.10
# Alix Vincent Thorn
# Alix Vincent Thorn
Line 113: Line 79:
database: "input/example_input_files/ResFinder_08_02_2022.fa"
database: "input/example_input_files/ResFinder_08_02_2022.fa"


input_list: "example/metadata.tsv"    # Edit this!
input_list: "/scratch/$USER/flankophile/metadata.tsv"    # Edit this!


min_coverage_abricate: "98"                  # Minimum coverage in percentage compared to reference sequence.
min_coverage_abricate: "98"                  # Minimum coverage in percentage compared to reference sequence.
Line 143: Line 109:
#      256      Cosine distance
#      256      Cosine distance
#    4096      Chi-square distance
#    4096      Chi-square distance
</syntaxhighlight>Assuming <code>flankophile/</code> was copied into your user's <code>/scratch</code> directory, this metadata file is at the following path:
</syntaxhighlight>This configuration file contains the path to the metadata file, which contains the paths to the input data.  
 
* <code>'''/scratch/''$USER''/flankophile/config.yaml'''</code>
 
This configuration file contains the path to the metadata file, which contains the paths to the input data.  


<u>When using your own data, be sure to edit the</u> ''<code>config.yaml</code>'' <u>file to include the location of the corresponding</u> <code>''metadata.tsv''</code> <u>file you created for them.</u>
<u>When creating a ''<code>config.yaml</code>'' file for your own data, be sure to</u> <u>include the location of the</u> <code>''metadata.tsv''</code> <u>file you created.</u>


==== Starting the Pipeline ====
==== Starting the Pipeline ====
Line 164: Line 126:
## <code>module load snakemake/6.9.1-Mamba-4.11.0-4</code>
## <code>module load snakemake/6.9.1-Mamba-4.11.0-4</code>
# '''Identify the name of the ''Conda'' environment to activate.'''
# '''Identify the name of the ''Conda'' environment to activate.'''
## <code>ls .snakemake/conda/</code>
##<code>ls .snakemake/conda/</code>
## This will be a random series of numbers and letters.
# '''Activate the ''Conda'' environment that ''Flankophile'' will use.'''
# '''Activate the ''Conda'' environment that ''Flankophile'' will use.'''
##<code>source activate /scratch/''$USER''/flankophile/.snakemake/conda/50e62e607e6a24bb70ce9a5a50888445</code>
##<code>source activate /scratch/''$USER''/flankophile/.snakemake/conda/'''50e62e607e6a24bb70ce9a5a50888445'''</code>
##Make sure to adjust the command according to the name of the environment, identified previously.
# '''Run the pipeline.'''
# '''Run the pipeline.'''
## <code>snakemake --use-conda --cores 16</code>
## <code>snakemake --use-conda --cores 16</code>
Line 191: Line 155:
</syntaxhighlight>'''Please note that you need to identify the correct name of the ''Conda'' environment just as you would if running the pipeline from within an interactive session.'''
</syntaxhighlight>'''Please note that you need to identify the correct name of the ''Conda'' environment just as you would if running the pipeline from within an interactive session.'''


Assuming <code>flankophile/</code> was copied into your user's <code>/scratch</code> directory, use the following steps to identify the name of the ''Conda'' environment:
Assuming <code>flankophile/</code> was cloned into your user's <code>/scratch</code> directory, use the following steps to identify the name of the ''Conda'' environment:


# '''Identify the name of the ''Conda'' environment to activate.'''
# '''Identify the name of the ''Conda'' environment to activate.'''
Line 200: Line 164:
=== Installation ===
=== Installation ===


* Version 0.2.10: Installed using EasyBuild.
* Version 0.2.10: Install in your user's scratch directory.


=== System ===
=== System ===


* 64-bit Linux
* 64-bit Linux

Revision as of 16:50, 7 March 2024

Category

Pipeline

Program On

Bitbucket

Version

0.2.10

Author / Distributor

Genomic Epidemiology

Description

"Flankophile is a bioinformatics pipeline for gene synteny analysis." Bitbucket

Running Program

Please also refer to Running Jobs on Sapelo2.

Requirements

Flankophile requires the following:

  1. The Flankophile pipeline
    1. Contains the pipeline's files and all required conda environments.
  2. Input sequence data in FASTA (.fa) format
    1. One or more FASTA files that Flankophile should use as input.
  3. A tab-separated value (.tsv) metadata file
    1. Describes the assembly names, their filepaths, and the genus for each input FASTA file.
  4. An analysis configuration file in YAML (.yaml)
    1. Describes the analysis for Flankophile to perform and the parameters to use during analysis.
    2. Should be located at the root of the Flankophile repository directory.

Setup

The steps to setup Flankophile are as follows:

  1. Enter into an interactive session The installation of Flankophile involves an amount of resource utilization that should be performed by a compute node and not a submit node.
    1. interact --cpus-per-task 16 --mem 32GB
  2. Change the working directory to your user's /scratch directory Like other pipelines, the Flankophile installation directory should be in a location suited for large amounts of sustained I/O. Do not install Flankophile in the /home directory; the filesystem that hosts /home is not suited for this workload.
    1. cd /scratch/$USER
  3. Clone the Flankophile repository at the desired commit The versions of Flankophile are not labeled with tags in the project's Bitbucket page. Instead, the version is stated in the commit message; this guide will use the commit corresponding to version 0.2.10 (34f286e).
    1. git clone https://bitbucket.org/genomicepidemiology/flankophile/src/34f286e00088d12019dc30cb2c38c103f0efe506/ flankophile
      1. The ...0efe506/ flankophile at the end instructs git to clone the repository into the directory, flankophile.
  4. Change the working directory to the newly cloned flankophile/ repository
    1. cd flankophile
  5. Load the Snakemake module.
    1. module load snakemake/6.9.1-Mamba-4.11.0-4
  6. Create the conda environments that Flankophile will use at runtime In some tests, creating the conda environments at runtime causes the pipeline to fail due to missing or incompatible dependencies. To avoid this, create the conda environments ahead of time. This step will take ~15-25 minutes to complete.
    1. snakemake --use-conda --conda-create-envs-only --cores 16

Example Files

As listed above, Flankophile requires input files, a metadata file, and a config file to run.

Example Input Files

Unfortunately, the Flankophile pipeline does not come with any input files to use as examples.

Example .tsv Metadata File

The following text is an example of a metadata file's contents:

/scratch/$USER/flankophile/metadata.tsv
#assembly_name    path    metadata
assembly32    /scratch/$USER/e_coli-assembly32.fa    Escherichia
assembly25    /scratch/$USER/p_vulgaris-assembly25.fa    Proteus

When creating a metadata.tsv file for your own data, be sure to take note of its filepath. This will be listed after input_list: in the config.yaml file, shown below.

Example .yaml Configuration File

The following text is an example of a Flankophile configuration file's contents:

/scratch/$USER/flankophile/config.yaml
# FLANKOPHILE version 0.2.10
# Alix Vincent Thorn


## 1 #######################################################################################

database: "input/example_input_files/ResFinder_08_02_2022.fa"

input_list: "/scratch/$USER/flankophile/metadata.tsv"     # Edit this!

min_coverage_abricate: "98"                   # Minimum coverage in percentage compared to reference sequence.

min_identity_abricate: "98"                   # Minimum percentage identity compared to reference sequence.

## 2 ######################################################################################

flank_length_upstreams: "1500"

flank_length_downstreams: "1500"

## 3 #######################################################################################

cluster_identity_and_length_diff: "0.98"      # CD-HIT parameters = c and s.  0.98 = cluster at 98 % identity and a maximum length difference of 98 %

## 4  ######################################################################################

k-mer_size: "16"                              # Kmersize used by kma index. Try with "16" if in doubt.

distance_measure: "1"                         # Choose measurement for making distance matrix



# Distance calculation methods:
#
#        1      k-mer hamming distance
#       64      Jaccard distance
#      256      Cosine distance
#     4096      Chi-square distance

This configuration file contains the path to the metadata file, which contains the paths to the input data.

When creating a config.yaml file for your own data, be sure to include the location of the metadata.tsv file you created.

Starting the Pipeline

Once the required files are in place, using either the above examples or your own inputs, the pipeline can be started in either an interactive session or within a job submission script.

In an Interactive Session

The procedure to start the Flankophile pipeline in an interactive session is as follows:

  1. Enter into an interactive session if not already inside one.
    1. interact --cpus-per-task 16 --mem 32GB
  2. Change the working directory to the location of the flankophile/ directory. (The following command assumes it is in the base of your /scratch directory)
    1. cd /scratch/$USER/flankophile/
  3. Load the snakemake module into your environment. (Flankophile requires snakemake 6.9.1 with Miniconda/Mamba 4.11.0)
    1. module load snakemake/6.9.1-Mamba-4.11.0-4
  4. Identify the name of the Conda environment to activate.
    1. ls .snakemake/conda/
    2. This will be a random series of numbers and letters.
  5. Activate the Conda environment that Flankophile will use.
    1. source activate /scratch/$USER/flankophile/.snakemake/conda/50e62e607e6a24bb70ce9a5a50888445
    2. Make sure to adjust the command according to the name of the environment, identified previously.
  6. Run the pipeline.
    1. snakemake --use-conda --cores 16
In a Job Submission Script

The following is an example of a job submission script that runs the Flankophile pipeline:

#!/bin/bash
#SBATCH --job-name=Flankophile
#SBATCH --partition=batch
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=32gb
#SBATCH --time=01:00:00
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err

module load snakemake/6.9.1-Mamba-4.11.0-4

# The following assumes flankophile/ was copied into the base of your user's /scratch 
cd /scratch/$USER/flankophile/
source activate /source/$USER/flankophile/.snakemake/conda/50e62e607e6a24bb70ce9a5a50888445

snakemake --use-conda --cores 16

Please note that you need to identify the correct name of the Conda environment just as you would if running the pipeline from within an interactive session.

Assuming flankophile/ was cloned into your user's /scratch directory, use the following steps to identify the name of the Conda environment:

  1. Identify the name of the Conda environment to activate.
    1. ls /scratch/$USER/flankophile/.snakemake/conda/

Make sure to adjust the job submission script so that the correct Conda environment is loaded.

Installation

  • Version 0.2.10: Install in your user's scratch directory.

System

  • 64-bit Linux