BRAKER-Sapelo2

From Research Computing Center Wiki
Revision as of 09:39, 14 June 2024 by Jordan (talk | contribs) (Revised the recommendations regarding TSEBRA, it turns out that this actually does get loaded with the two foss-2022a modules for v3.0.3)
Jump to navigation Jump to search

Category

Bioinformatics

Program On

Sapelo2

Version

2.1.6, 3.0.3, 3.0.8

Author / Distributor

Gaius-Augustus

Description

"BRAKER is a pipeline for fully automated prediction of protein coding genes with GeneMark-ES/ET and AUGUSTUS in novel eukaryotic genomes." More details are at BRAKER

Running Program

Also refer to Running Jobs on Sapelo2


Version 2.1.6

Version 2.1.6, installed at

  • /apps/eb/BRAKER/2.1.6-foss-2022a/

To use it, please load the module with:

module load BRAKER/2.1.6-foss-2022a

Version 3.0.3

Version 3.0.3 installed at

  • /apps/eb/BRAKER/3.0.3-foss-2022a-long_reads/
  • /apps/eb/BRAKER/3.0.3-foss-2022a/
  • /apps/eb/BRAKER/3.0.3-long_reads/

To use it, please load the module with:

module load BRAKER/3.0.3-foss-2022a-long_reads

or

module load BRAKER/3.0.3-foss-2022a

or

module load BRAKER/3.0.3-foss-2022a-long_reads

Version 3.0.8

Version 3.0.8 installed at

  • /apps/eb/BRAKER/3.0.8-foss-2022a/

To use it, please load the module with:

module load BRAKER/3.0.8-foss-2022a

In order to use BRAKER, you need to have downloaded the key file for GeneMarker and put it into your home directory. Please follow instructions found at GeneMarkES-Sapelo2.

Furthermore, BRAKER requires TSEBRA, of which Version 1.1.2 is installed on Sapelo2 but not currently included in BRAKER/3.0.8-foss-2022a. Therefore, you will also need to load TSEBRA separately if you want to use Version 3.0.8 of BRAKER. TSEBRA 1.1.2 can be found at

  • /apps/eb/TSEBRA/1.1.2-foss-2022a-long_reads/
  • /apps/eb/TSEBRA/1.1.2-foss-2022a/
  • /apps/eb/TSEBRA/1.1.2-GCCcore-11.3.0

It is recommended that you use the second option above, so as to avoid any dependency conflicts or other unforeseeable problems.

Finally, in order to use BRAKER to predict genes using a protein database, BRAKER needs to use ProtHint; this is included in GeneMark, but BRAKER does not always successfully find it, so you may find it useful to specify the path to it using the following (see also the example submission script below):

--PROTHINT_PATH=/apps/eb/GeneMark-ET/4.71-GCCcore-11.3.0/ProtHint/bin/

Here is an example of a shell script sub.sh to run BRAKER using a protein database from the batch queue:

#!/bin/bash
#SBATCH --job-name=BRAKER_Job
#SBATCH --partition=batch
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@uga.edu
#SBATCH --ntasks=1
#SBATCH --mem=16gb
#SBATCH --time=48:00:00
#SBATCH --output=log.%j.out
#SBATCH --error=log.%j.err

cd $SLURM_SUBMIT_DIR

module load BRAKER/3.0.8-foss-2022a
module load TSEBRA/1.1.2-foss-2022a

braker.pl --genome=Newly_assembled_genome.fasta --prot_seq=Protein_database.faa --PROTHINT_PATH=/apps/eb/GeneMark-ET/4.71-GCCcore-11.3.0/ProtHint/bin/

In the real submission script, be sure to replace the above underlined values with proper values.

Here is an example of job submission command:

sbatch ./sub.sh 

Documentation

module load BRAKER/3.0.8-foss-2022a
braker.pl 

DESCRIPTION

braker.pl   Pipeline for predicting genes with GeneMark-EX and AUGUSTUS with
            RNA-Seq and/or proteins

SYNOPSIS

braker.pl [OPTIONS] --genome=genome.fa {--bam=rnaseq.bam | --prot_seq=prot.fa}

INPUT FILE OPTIONS

--genome=genome.fa                  fasta file with DNA sequences
--bam=rnaseq.bam                    bam file with spliced alignments from
                                    RNA-Seq
--prot_seq=prot.fa                  A protein sequence file in multi-fasta
                                    format used to generate protein hints.
                                    Unless otherwise specified, braker.pl will
                                    run in "EP mode" or "ETP mode which uses 
                                    ProtHint to generate protein hints and 
                                    GeneMark-EP+ or GeneMark-ETP to
                                    train AUGUSTUS.
--hints=hints.gff                   Alternatively to calling braker.pl with a
                                    bam or protein fasta file, it is possible to
                                    call it with a .gff file that contains
                                    introns extracted from RNA-Seq and/or
                                    protein hints (most frequently coming
                                    from ProtHint). If you wish to use the
                                    ProtHint hints, use its
                                    "prothint_augustus.gff" output file.
                                    This flag also allows the usage of hints
                                    from additional extrinsic sources for gene
                                    prediction with AUGUSTUS. To consider such
                                    additional extrinsic information, you need
                                    to use the flag --extrinsicCfgFiles to
                                    specify parameters for all sources in the
                                    hints file (including the source "E" for
                                    intron hints from RNA-Seq).
                                    In ETP mode, this option can be used together
                                    with --geneMarkGtf and --traingenes to provide
                                    BRAKER with results of a previous GeneMark-ETP
                                    run, so that the GeneMark-ETP step can be
                                    skipped. In this case, specify the hintsfile of
                                    a previous BRAKER run here, or generate a
                                    hintsfile from the GeneMark-ETP working
                                    directory with the script get_etp_hints.py.
--rnaseq_sets_ids=SRR1111,SRR1115   IDs of RNA-Seq sets that are either in
                                    one of the directories specified with
                                    --rnaseq_sets_dir, or that can be downloaded
                                    from SRA. If you want to use local files, you
                                    can use unaligned reads in FASTQ format
                                    (they have to be named ID.fastq if unpaired or
                                    ID_1.fastq, ID_2.fastq if paired), or aligned reads
                                    as a BAM file (named ID.bam).
--rnaseq_sets_dir=/path/to/rna_dir1 Locations where the local files of RNA-Seq data
                                    reside that were specified with --rnaseq_sets_ids.

FREQUENTLY USED OPTIONS

--species=sname                     Species name. Existing species will not be
                                    overwritten. Uses Sp_1 etc., if no species
                                    is assigned
--AUGUSTUS_ab_initio                output ab initio predictions by AUGUSTUS
                                    in addition to predictions with hints by
                                    AUGUSTUS
--softmasking_off                   Turn off softmasking option (enables by 
                                    default, discouraged to disable!)
--esmode                            Run GeneMark-ES (genome sequence only) and
                                    train AUGUSTUS on long genes predicted by
                                    GeneMark-ES. Final predictions are ab initio
--gff3                              Output in GFF3 format (default is gtf
                                    format)
--threads                           Specifies the maximum number of threads that
                                    can be used during computation. Be aware:
                                    optimize_augustus.pl will use max. 8
                                    threads; augustus will use max. nContigs in
                                    --genome=file threads.
--workingdir=/path/to/wd/           Set path to working directory. In the
                                    working directory results and temporary
                                    files are stored
--nice                              Execute all system calls within braker.pl
                                    and its submodules with bash "nice"
                                    (default nice value)
--alternatives-from-evidence=true   Output alternative transcripts based on
                                    explicit evidence from hints (default is
                                    true).
--fungus                            GeneMark-EX option: run algorithm with
                                    branch point model (most useful for fungal
                                    genomes)
--crf                               Execute CRF training for AUGUSTUS;
                                    resulting parameters are only kept for
                                    final predictions if they show higher
                                    accuracy than HMM parameters.
--keepCrf                           keep CRF parameters even if they are not
                                    better than HMM parameters
--makehub                           Create track data hub with make_hub.py
                                    for visualizing BRAKER results with the
                                    UCSC GenomeBrowser
--busco_lineage=lineage             If you provide a BUSCO lineage, BRAKER will
                                    run compleasm on genome level to generate hints
                                    from BUSCO to enhance BUSCO discovery in the
                                    protein set. Also, if you provide a BUSCO
                                    lineage, BRAKER will run compleasm to assess
                                    the protein sets that go into TSEBRA combination,
                                    and will determine the TSEBRA mode to maximize
                                    BUSCO. Do not provide a busco_lineage if you
                                    want to determina natural BUSCO sensivity of
                                    BRAKER!
--email                             E-mail address for creating track data hub
--version                           Print version number of braker.pl
--help                              Print this help message

CONFIGURATION OPTIONS (TOOLS CALLED BY BRAKER)

--AUGUSTUS_CONFIG_PATH=/path/       Set path to config directory of AUGUSTUS
                                    (if not specified as environment
                                    variable). BRAKER1 will assume that the
                                    directories ../bin and ../scripts of
                                    AUGUSTUS are located relative to the
                                    AUGUSTUS_CONFIG_PATH. If this is not the
                                    case, please specify AUGUSTUS_BIN_PATH
                                    (and AUGUSTUS_SCRIPTS_PATH if required).
                                    The braker.pl commandline argument
                                    --AUGUSTUS_CONFIG_PATH has higher priority
                                    than the environment variable with the
                                    same name.
--AUGUSTUS_BIN_PATH=/path/          Set path to the AUGUSTUS directory that
                                    contains binaries, i.e. augustus and
                                    etraining. This variable must only be set
                                    if AUGUSTUS_CONFIG_PATH does not have
                                    ../bin and ../scripts of AUGUSTUS relative
                                     to its location i.e. for global AUGUSTUS
                                    installations. BRAKER1 will assume that
                                    the directory ../scripts of AUGUSTUS is
                                    located relative to the AUGUSTUS_BIN_PATH.
                                    If this is not the case, please specify
                                    --AUGUSTUS_SCRIPTS_PATH.
--AUGUSTUS_SCRIPTS_PATH=/path/      Set path to AUGUSTUS directory that
                                    contains scripts, i.e. splitMfasta.pl.
                                    This variable must only be set if
                                    AUGUSTUS_CONFIG_PATH or AUGUSTUS_BIN_PATH
                                    do not contains the ../scripts directory
                                    of AUGUSTUS relative to their location,
                                    i.e. for special cases of a global
                                    AUGUSTUS installation.
--BAMTOOLS_PATH=/path/to/           Set path to bamtools (if not specified as
                                    environment BAMTOOLS_PATH variable). Has
                                    higher priority than the environment
                                    variable.
--GENEMARK_PATH=/path/to/           Set path to GeneMark-ET (if not specified
                                    as environment GENEMARK_PATH variable).
                                    Has higher priority than environment
                                    variable.
--SAMTOOLS_PATH=/path/to/           Optionally set path to samtools (if not
                                    specified as environment SAMTOOLS_PATH
                                    variable) to fix BAM files automatically,
                                    if necessary. Has higher priority than
                                    environment variable.
--PROTHINT_PATH=/path/to/           Set path to the directory with prothint.py.
                                    (if not specified as PROTHINT_PATH
                                    environment variable). Has higher priority
                                    than environment variable.
--DIAMOND_PATH=/path/to/diamond     Set path to diamond, this is an alternative
                                    to NCIB blast; you only need to specify one
                                    out of DIAMOND_PATH or BLAST_PATH, not both.
                                    DIAMOND is a lot faster that BLAST and yields
                                    highly similar results for BRAKER.
--BLAST_PATH=/path/to/blastall      Set path to NCBI blastall and formatdb
                                    executables if not specified as
                                    environment variable. Has higher priority
                                    than environment variable.
--COMPLEASM_PATH=/path/to/compleasm Set path to compleasm (if not specified as
                                    environment variable). Has higher priority
                                    than environment variable.
--PYTHON3_PATH=/path/to             Set path to python3 executable (if not
                                    specified as envirnonment variable and if
                                    executable is not in your $PATH).
--JAVA_PATH=/path/to                Set path to java executable (if not
                                    specified as environment variable and if
                                    executable is not in your $PATH), only
                                    required with flags --UTR=on and --addUTR=on
--GUSHR_PATH=/path/to               Set path to gushr.py exectuable (if not
                                    specified as an environment variable and if
                                    executable is not in your $PATH), only required
                                    with the flags --UTR=on and --addUTR=on
--MAKEHUB_PATH=/path/to             Set path to make_hub.py (if option --makehub
                                    is used).
--CDBTOOLS_PATH=/path/to            cdbfasta/cdbyank are required for running
                                    fix_in_frame_stop_codon_genes.py. Usage of
                                    that script can be skipped with option
                                    '--skip_fixing_broken_genes'.


EXPERT OPTIONS

--augustus_args="--some_arg=bla"    One or several command line arguments to
                                    be passed to AUGUSTUS, if several
                                    arguments are given, separate them by
                                    whitespace, i.e.
                                    "--first_arg=sth --second_arg=sth".
--skipGeneMark-ES                   Skip GeneMark-ES and use provided
                                    GeneMark-ES output (e.g. provided with
                                    --geneMarkGtf=genemark.gtf)
--skipGeneMark-ET                   Skip GeneMark-ET and use provided
                                    GeneMark-ET output (e.g. provided with
                                    --geneMarkGtf=genemark.gtf)
--skipGeneMark-EP                   Skip GeneMark-EP and use provided
                                    GeneMark-EP output (e.g. provided with
                                    --geneMarkGtf=genemark.gtf)
--skipGeneMark-ETP                  Skip GeneMark-ETP and use provided
                                    GeneMark-ETP output (e.g. provided with
                                    --gmetp_results_dir=GeneMark-ETP/)
--geneMarkGtf=file.gtf              If skipGeneMark-ET is used, braker will by
                                    default look in the working directory in
                                    folder GeneMarkET for an already existing
                                    gtf file. Instead, you may provide such a
                                    file from another location. If geneMarkGtf
                                    option is set, skipGeneMark-ES/ET/EP/ETP is
                                    automatically also set. Note that gene and
                                    transcript ids in the final output may not
                                    match the ids in the input genemark.gtf
                                    because BRAKER internally re-assigns these
                                    ids.
                                    In ETP mode, this option hast to be used together
                                    with --traingenes and --hints to provide BRAKER
                                    with results of a previous GeneMark-ETP run.
--gmetp_results_dir                 Location of results from a previous
                                    GeneMark-ETP run, which will be used to
                                    skip the GeneMark-ETP step. This option
                                    can be used instead of --geneMarkGtf,
                                    --traingenes, and --hints to skip GeneMark.
--rounds                            The number of optimization rounds used in
                                    optimize_augustus.pl (default 5)
--skipAllTraining                   Skip GeneMark-EX (training and
                                    prediction), skip AUGUSTUS training, only
                                    runs AUGUSTUS with pre-trained and already
                                    existing parameters (not recommended).
                                    Hints from input are still generated.
                                    This option automatically sets
                                    --useexisting to true.
--useexisting                       Use the present config and parameter files
                                    if they exist for 'species'; will overwrite
                                    original parameters if BRAKER performs
                                    an AUGUSTUS training.
--filterOutShort                    It may happen that a "good" training gene,
                                    i.e. one that has intron support from
                                    RNA-Seq in all introns predicted by
                                    GeneMark-EX, is in fact too short. This flag
                                    will discard such genes that have
                                    supported introns and a neighboring
                                    RNA-Seq supported intron upstream of the
                                    start codon within the range of the
                                    maximum CDS size of that gene and with a
                                    multiplicity that is at least as high as
                                    20% of the average intron multiplicity of
                                    that gene.
--skipOptimize                      Skip optimize parameter step (not
                                    recommended).
--skipIterativePrediction           Skip iterative prediction in --epmode (does
                                    not affect other modes, saves a bit of runtime)
--skipGetAnnoFromFasta              Skip calling the python3 script
                                    getAnnoFastaFromJoingenes.py from the
                                    AUGUSTUS tool suite. This script requires
                                    python3, biopython and re (regular
                                    expressions) to be installed. It produces
                                    coding sequence and protein FASTA files
                                    from AUGUSTUS gene predictions and provides
                                    information about genes with in-frame stop
                                    codons. If you enable this flag, these files
                                    will not be produced and python3 and
                                    the required modules will not be necessary
                                    for running brkaker.pl.
--skip_fixing_broken_genes          If you do not have python3, you can choose
                                    to skip the fixing of stop codon including
                                    genes (not recommended).
--eval=reference.gtf                Reference set to evaluate predictions
                                    against (using evaluation scripts from GaTech)
--eval_pseudo=pseudo.gff3           File with pseudogenes that will be excluded
                                    from accuracy evaluation (may be empty file)
--AUGUSTUS_hints_preds=s            File with AUGUSTUS hints predictions; will
                                    use this file as basis for UTR training;
                                    only UTR training and prediction is
                                    performed if this option is given.
--flanking_DNA=n                    Size of flanking region, must only be
                                    specified if --AUGUSTUS_hints_preds is given
                                    (for UTR training in a separate braker.pl
                                    run that builds on top of an existing run)
--verbosity=n                       0 -> run braker.pl quiet (no log)
                                    1 -> only log warnings
                                    2 -> also log configuration
                                    3 -> log all major steps
                                    4 -> very verbose, log also small steps
--downsampling_lambda=d             The distribution of introns in training
                                    gene structures generated by GeneMark-EX
                                    has a huge weight on single-exon and
                                    few-exon genes. Specifying the lambda
                                    parameter of a poisson distribution will
                                    make braker call a script for downsampling
                                    of training gene structures according to
                                    their number of introns distribution, i.e.
                                    genes with none or few exons will be
                                    downsampled, genes with many exons will be
                                    kept. Default value is 2.
                                    If you want to avoid downsampling, you have
                                    to specify 0.
--checkSoftware                     Only check whether all required software
                                    is installed, no execution of BRAKER
--nocleanup                         Skip deletion of all files that are typically not
                                    used in an annotation project after
                                    running braker.pl. (For tracking any
                                    problems with a braker.pl run, you
                                    might want to keep these files, therefore
                                    nocleanup can be activated.)


DEVELOPMENT OPTIONS (PROBABLY STILL DYSFUNCTIONAL)

--splice_sites=patterns             list of splice site patterns for UTR
                                    prediction; default: GTAG, extend like this:
                                    --splice_sites=GTAG,ATAC,...
                                    this option only affects UTR training
                                    example generation, not gene prediction
                                    by AUGUSTUS
--overwrite                         Overwrite existing files (except for
                                    species parameter files) Beware, currently
                                    not implemented properly!
--extrinsicCfgFiles=file1,file2,... Depending on the mode in which braker.pl
                                    is executed, it may require one ore several
                                    extrinsicCfgFiles. Don't use this option
                                    unless you know what you are doing!
--stranded=+,-,+,-,...              If UTRs are trained, i.e.~strand-specific
                                    bam-files are supplied and coverage
                                    information is extracted for gene prediction,
                                    create stranded ep hints. The order of
                                    strand specifications must correspond to the
                                    order of bam files. Possible values are
                                    +, -, .
                                    If stranded data is provided, ONLY coverage
                                    data from the stranded data is used to
                                    generate UTR examples! Coverage data from
                                    unstranded data is used in the prediction
                                    step, only.
                                    The stranded label is applied to coverage
                                    data, only. Intron hints are generated
                                    from all libraries treated as "unstranded"
                                    (because splice site filtering eliminates
                                    intron hints from the wrong strand, anyway).
--optCfgFile=ppx.cfg                Optional custom config file for AUGUSTUS
                                    for running PPX (currently not
                                    implemented)
--grass                             Switch this flag on if you are using braker.pl
                                    for predicting genes in grasses with
                                    GeneMark-EX. The flag will enable
                                    GeneMark-EX to handle GC-heterogenicity
                                    within genes more properly.
                                    NOTHING IMPLEMENTED FOR GRASS YET!
--transmasked_fasta=file.fa         Transmasked genome FASTA file for GeneMark-EX
                                    (to be used instead of the regular genome
                                    FASTA file).
--min_contig=INT                    Minimal contig length for GeneMark-EX, could
                                    for example be set to 10000 if transmasked_fasta
                                    option is used because transmasking might
                                    introduce many very short contigs.
--translation_table=INT             Change translation table from non-standard
                                    to something else.
                                    DOES NOT WORK YET BECAUSE BRAKER DOESNT
                                    SWITCH TRANSLATION TABLE FOR GENEMARK-EX, YET!
--gc_probability=DECIMAL            Probablity for donor splice site pattern GC
                                    for gene prediction with GeneMark-EX,
                                    default value is 0.001
--gm_max_intergenic=INT             Adjust maximum allowed size of intergenic
                                    regions in GeneMark-EX. If not used, the value
                                    is automatically determined by GeneMark-EX.
--traingenes=file.gtf               Training genes that are used instead of training
                                    genes generated with GeneMark.
                                    In ETP mode, this option can be used together
                                    with --geneMarkGtf and --hints to provide BRAKER
                                    with results of a previous GeneMark-ETP run, so
                                    that the GeneMark-ETP step can be skipped.
                                    In this case, use training.gtf from that run as
                                    argument.
--UTR=on                            create UTR training examples from RNA-Seq
                                    coverage data; requires options
                                    --bam=rnaseq.bam.
                                    Alternatively, if UTR parameters already
                                    exist, training step will be skipped and
                                    those pre-existing parameters are used.
				    DO NOT USE IN CONTAINER!
				    TRY NOT TO USE AT ALL!
--addUTR=on                         Adds UTRs from RNA-Seq coverage data to
                                    augustus.hints.gtf file. Does not perform
                                    training of AUGUSTUS or gene prediction with
                                    AUGUSTUS and UTR parameters.
				    DO NOT USE IN CONTAINER!
				    TRY NOT TO USE AT ALL!


EXAMPLE

To run with RNA-Seq

braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \
    --bam=accepted_hits.bam
braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \
    --hints=rnaseq.gff

To run with protein sequences

braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \
    --prot_seq=proteins.fa
braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \
    --hints=prothint_augustus.gff

To run with RNA-Seq and protein sequences

braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \
    --prot_seq=proteins.fa --rnaseq_sets_ids=id_rnaseq1,id_rnaseq2 \
    --rnaseq_sets_dir=/path/to/local/rnaseq/files 
braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \
    --prot_seq=proteins.fa --bam=id_rnaseq1.bam,id_rnaseq2.bam

Back to Top

Installation

source code from BRAKER

System

64-bit Linux