Bracken-Sapelo2

From Research Computing Center Wiki
Jump to navigation Jump to search

Category

Bioinformatics

Program On

Sapelo2

Version

3.1

Author / Distributor

Bracken

Description

"Bracken (Bayesian Reestimation of Abundance with KrakEN) is a highly accurate statistical method that computes the abundance of species in DNA sequences from a metagenomics sample. Braken uses the taxonomy labels assigned by Kraken, a highly accurate metagenomics classification algorithm, to estimate the number of reads originating from each species present in a sample. Kraken classifies reads to the best matching location in the taxonomic tree, but does not estimate abundances of species. We use the Kraken database itself to derive probabilities that describe how much sequence from each genome is identical to other genomes in the database, and combine this information with the assignments for a particular sample to estimate abundance at the species level, the genus level, or above. Combined with the Kraken classifier, Bracken produces accurate species- and genus-level abundance estimates even when a sample contains two or more near-identical species.

NOTE: Bracken is compatible with both Kraken 1 and Kraken 2. However, the default kmer length is different depending on the version of Kraken used. If you use Kraken 1 defaults, specify 31 as the kmer length. If you use Kraken 2 defaults, specify 35 as the kmer length." More details are at Bracken instructions

Running Program

Also refer to Running Jobs on Sapelo2

We currently have the latest version of the following databases for Bracken available in /db/Bracken/20250827/:

bacteria
core_nt
eupathdb
fungi
gtdb
pluspf
pluspfp
pluspfp-16
viral

The bacteria database has been created for a read length of 100 (which is the default for Bracken), while the fungi and viral databases are for a read length of 150; all other databases can accommodate read lengths of 50, 75, 100, 150, 200, 250, and 300. Note, however, that only core_nt, eupathdb, gtdb, pluspf, pluspfp, and pluspfp-16 are also available for Kraken2 in /db/kraken2/20250814/.

Bracken analysis involves two steps, the first with Kraken 1 or 2 (Kraken2 is recommended as it involves only a single command and we also no longer have any Kraken1 databases available, so if you want to use Kraken1 you will need to acquire these for yourself) to generate a report file based on your FASTQ-formatted metagenomics sample, and the second with Bracken itself to estimate species- or other taxon-level abundances based on this report file. More detailed information can be found in the directions in the GitHub repository, bearing in mind that what you're doing is what the developers call "Step 2" and "Step 3" (their "Step 1" is to generate the Braken database files, but this isn't necessary because we already have them available).

Version 3.1

Version 3.1, installed at

  • /apps/eb/Bracken/3.1-GCCcore-12.3.0/

To use it, please load the module with:

module load Bracken/3.1-GCCcore-12.3.0

Note that you need to load Kraken (if you use Kraken1 then you'll need to load it before you load Bracken, run the first step mentioned above, and then unload it before loading Bracken as we do not have Kraken1 installed with a compatible toolchain) and it's a good idea to load a toolchain-compatible version of Python (i.e., Python/3.11.3-GCCcore-12.3.0), as well.

Here is an example of a shell script sub.sh to run Bracken the batch partition using a read length of 150 and the PlusPF Kraken2/Bracken database:

#!/bin/bash
#SBATCH --job-name=BrackenJob
#SBATCH --partition=batch
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@uga.edu
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem100gb
#SBATCH --time=08:00:00
#SBATCH --output=log.%j.out
#SBATCH --error=log.%j.err

cd $SLURM_SUBMIT_DIR

module load Bracken/3.1-GCCcore-12.3.0
module load Kraken2/2.1.3-gompi-2023a
module load Python/3.11.3-GCCcore-12.3.0

#Do Step 1, where "SAMPLE.fastq" is the input file containing your metagenomics sequencing reads
kraken2 --db /db/kraken2/20250814/pluspf --threads 4 --report SAMPLE.kreport SAMPLE.fastq > SAMPLE.kraken

#Do Step 2 using the .kreport file generated in Step 1 as the input
bracken -d /db/Bracken/20250827/pluspf -i ./SAMPLE.kreport -r 150 -o ./SAMPLE_150mers.bracken

In the real submission script, at least all the above underlined values need to be reviewed or to be replaced by the proper values (make sure the value you use for "--cpus-per-task=" in the Slurm headers is the same as that used for "--threads" in the Kraken2 command)

Here is an example of job submission command:

sbatch ./sub.sh 

Documentation

$ module load Bracken/3.1-GCCcore-12.3.0
$ bracken -h
Usage: bracken -v -d MY_DB -i INPUT -o OUTPUT -w OUTREPORT -r READ_LEN -l LEVEL -t THRESHOLD
  -v             Echoes the current software version and exits
  MY_DB          location of Kraken database
  INPUT          Kraken REPORT file to use for abundance estimation
  OUTPUT         file name for Bracken default output
  OUTREPORT      New Kraken REPORT output file with Bracken read estimates
  READ_LEN       read length to get all classifications for (default: 100)
  LEVEL          level to estimate abundance at [options: D,P,C,O,F,G,S,S1,etc] (default: S)
  THRESHOLD      number of reads required PRIOR to abundance estimation to perform reestimation (default: 0)

Back to Top

Installation

source code from Bracken

System

64-bit Linux