StringTie-Teaching: Difference between revisions

From Research Computing Center Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
(5 intermediate revisions by 2 users not shown)
Line 9: Line 9:


=== Version ===
=== Version ===
1.3.4d
2.2.1
   
   
=== Author / Distributor ===
=== Author / Distributor ===
Line 21: Line 21:
=== Running Program ===
=== Running Program ===


The last version of this application is at /usr/local/apps/eb/StringTie/1.3.4d-foss-2016b
Version 2.2.1 of this application is in /apps/eb/StringTie/2.2.1-GCC-11.2.0


To use this version, please load the module with
To use this version, please load the module with
<pre class="gscript">
<pre class="gscript">
ml StringTie/1.3.4d-foss-2016b
ml StringTie/2.2.1-GCC-11.2.0
</pre>  
</pre>  


Line 43: Line 43:
   
   
cd $SLURM_SUBMIT_DIR<br>
cd $SLURM_SUBMIT_DIR<br>
ml StringTie/1.3.4d-foss-2016b<br>     
ml StringTie/2.2.1-GCC-11.2.0<br>     
stringtie <u>[options]</u><br>   
stringtie <u>[options]</u><br>   
</div>
</div>
Line 59: Line 59:
   
   
<pre  class="gcommand">
<pre  class="gcommand">
ml StringTie/1.3.4d-foss-2016b
ml StringTie/2.2.1-GCC-11.2.0
 
stringtie --help
stringtie --help
StringTie v1.3.4d usage:
 
stringtie <input.bam ..> [-G <guide_gff>] [-l <label>] [-o <out_gtf>] [-p <cpus>]
StringTie v2.2.1 usage:
  [-v] [-a <min_anchor_len>] [-m <min_tlen>] [-j <min_anchor_cov>] [-f <min_iso>]
 
  [-C <coverage_file_name>] [-c <min_bundle_cov>] [-g <bdist>] [-u]
stringtie <in.bam ..> [-G <guide_gff>] [-l <prefix>] [-o <out.gtf>] [-p <cpus>]
  [-e] [-x <seqid,..>] [-A <gene_abund.out>] [-h] {-B | -b <dir_path>}  
[-v] [-a <min_anchor_len>] [-m <min_len>] [-j <min_anchor_cov>] [-f <min_iso>]
[-c <min_bundle_cov>] [-g <bdist>] [-u] [-L] [-e] [--viral] [-E <err_margin>]
[--ptf <f_tab>] [-x <seqid,..>] [-A <gene_abund.out>] [-h] {-B|-b <dir_path>}
[--mix] [--conservative] [--rf] [--fr]
Assemble RNA-Seq alignments into potential transcripts.
Assemble RNA-Seq alignments into potential transcripts.
Options:
Options:
  --version : print just the version at stdout and exit
  --version : print just the version at stdout and exit
  -G reference annotation to use for guiding the assembly process (GTF/GFF3)
  --conservative : conservative transcript assembly, same as -t -c 1.5 -f 0.05
  --rf assume stranded library fr-firststrand
--mix : both short and long read data alignments are provided
  --fr assume stranded library fr-secondstrand
        (long read alignments must be the 2nd BAM/CRAM input file)
  --rf : assume stranded library fr-firststrand
  --fr : assume stranded library fr-secondstrand
-G reference annotation to use for guiding the assembly process (GTF/GFF)
--ptf : load point-features from a given 4 column feature file <f_tab>
-o output path/file name for the assembled transcripts GTF (default: stdout)
  -l name prefix for output transcripts (default: STRG)
  -l name prefix for output transcripts (default: STRG)
  -f minimum isoform fraction (default: 0.1)
  -f minimum isoform fraction (default: 0.01)
-L long reads processing; also enforces -s 1.5 -g 0 (default:false)
-R if long reads are provided, just clean and collapse the reads but
    do not assemble
  -m minimum assembled transcript length (default: 200)
  -m minimum assembled transcript length (default: 200)
-o output path/file name for the assembled transcripts GTF (default: stdout)
  -a minimum anchor length for junctions (default: 10)
  -a minimum anchor length for junctions (default: 10)
  -j minimum junction coverage (default: 1)
  -j minimum junction coverage (default: 1)
  -t disable trimming of predicted transcripts based on coverage
  -t disable trimming of predicted transcripts based on coverage
     (default: coverage trimming is enabled)
     (default: coverage trimming is enabled)
  -c minimum reads per bp coverage to consider for transcript assembly
  -c minimum reads per bp coverage to consider for multi-exon transcript
     (default: 2.5)
     (default: 1)
-s minimum reads per bp coverage to consider for single-exon transcript
    (default: 4.75)
  -v verbose (log bundle processing details)
  -v verbose (log bundle processing details)
  -g gap between read mappings triggering a new bundle (default: 50)
  -g maximum gap allowed between read mappings (default: 50)
-C output a file with reference transcripts that are covered by reads
  -M fraction of bundle allowed to be covered by multi-hit reads (default:1)
  -M fraction of bundle allowed to be covered by multi-hit reads (default:0.95)
  -p number of threads (CPUs) to use (default: 1)
  -p number of threads (CPUs) to use (default: 1)
  -A gene abundance estimation output file
  -A gene abundance estimation output file
-E define window around possibly erroneous splice sites from long reads to
    look out for correct splice sites (default: 25)
  -B enable output of Ballgown table files which will be created in the
  -B enable output of Ballgown table files which will be created in the
     same directory as the output GTF (requires -G, -o recommended)
     same directory as the output GTF (requires -G, -o recommended)
Line 93: Line 107:
     created under the directory path given as <dir_path>
     created under the directory path given as <dir_path>
  -e only estimate the abundance of given reference transcripts (requires -G)
  -e only estimate the abundance of given reference transcripts (requires -G)
--viral : only relevant for long reads from viral data where splice sites
    do not follow consensus (default:false)
  -x do not assemble any transcripts on the given reference sequence(s)
  -x do not assemble any transcripts on the given reference sequence(s)
  -u no multi-mapping correction (default: correction enabled)
  -u no multi-mapping correction (default: correction enabled)
  -h print this usage message and exit
  -h print this usage message and exit
--ref/--cram-ref reference genome FASTA file for CRAM input


Transcript merge usage mode:  
Transcript merge usage mode:  

Revision as of 14:00, 6 September 2023

Category

Bioinformatics

Program On

Teaching

Version

2.2.1

Author / Distributor

StringTie

Description

"StringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts." More details are at StringTie

Running Program

Version 2.2.1 of this application is in /apps/eb/StringTie/2.2.1-GCC-11.2.0

To use this version, please load the module with

ml StringTie/2.2.1-GCC-11.2.0

Here is an example of a shell script, sub.sh, to run on the batch queue:

#!/bin/bash
#SBATCH --job-name=j_StringTie
#SBATCH --partition=batch
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@uga.edu
#SBATCH --ntasks=1
#SBATCH --mem=10gb
#SBATCH --time=08:00:00
#SBATCH --output=StringTie.%j.out
#SBATCH --error=StringTie.%j.err

cd $SLURM_SUBMIT_DIR
ml StringTie/2.2.1-GCC-11.2.0
stringtie [options]

In the real submission script, at least all the above underlined values need to be reviewed or to be replaced by the proper values.

Please refer to Running_Jobs_on_the_teaching_cluster, Run X window Jobs and Run interactive Jobs for more details of running jobs at Teaching cluster.


Here is an example of job submission command:

sbatch ./sub.sh 

Documentation

ml StringTie/2.2.1-GCC-11.2.0

stringtie --help

StringTie v2.2.1 usage:

stringtie <in.bam ..> [-G <guide_gff>] [-l <prefix>] [-o <out.gtf>] [-p <cpus>]
 [-v] [-a <min_anchor_len>] [-m <min_len>] [-j <min_anchor_cov>] [-f <min_iso>]
 [-c <min_bundle_cov>] [-g <bdist>] [-u] [-L] [-e] [--viral] [-E <err_margin>]
 [--ptf <f_tab>] [-x <seqid,..>] [-A <gene_abund.out>] [-h] {-B|-b <dir_path>}
 [--mix] [--conservative] [--rf] [--fr]
Assemble RNA-Seq alignments into potential transcripts.
Options:
 --version : print just the version at stdout and exit
 --conservative : conservative transcript assembly, same as -t -c 1.5 -f 0.05
 --mix : both short and long read data alignments are provided
        (long read alignments must be the 2nd BAM/CRAM input file)
 --rf : assume stranded library fr-firststrand
 --fr : assume stranded library fr-secondstrand
 -G reference annotation to use for guiding the assembly process (GTF/GFF)
 --ptf : load point-features from a given 4 column feature file <f_tab>
 -o output path/file name for the assembled transcripts GTF (default: stdout)
 -l name prefix for output transcripts (default: STRG)
 -f minimum isoform fraction (default: 0.01)
 -L long reads processing; also enforces -s 1.5 -g 0 (default:false)
 -R if long reads are provided, just clean and collapse the reads but
    do not assemble
 -m minimum assembled transcript length (default: 200)
 -a minimum anchor length for junctions (default: 10)
 -j minimum junction coverage (default: 1)
 -t disable trimming of predicted transcripts based on coverage
    (default: coverage trimming is enabled)
 -c minimum reads per bp coverage to consider for multi-exon transcript
    (default: 1)
 -s minimum reads per bp coverage to consider for single-exon transcript
    (default: 4.75)
 -v verbose (log bundle processing details)
 -g maximum gap allowed between read mappings (default: 50)
 -M fraction of bundle allowed to be covered by multi-hit reads (default:1)
 -p number of threads (CPUs) to use (default: 1)
 -A gene abundance estimation output file
 -E define window around possibly erroneous splice sites from long reads to
    look out for correct splice sites (default: 25)
 -B enable output of Ballgown table files which will be created in the
    same directory as the output GTF (requires -G, -o recommended)
 -b enable output of Ballgown table files but these files will be 
    created under the directory path given as <dir_path>
 -e only estimate the abundance of given reference transcripts (requires -G)
 --viral : only relevant for long reads from viral data where splice sites
    do not follow consensus (default:false)
 -x do not assemble any transcripts on the given reference sequence(s)
 -u no multi-mapping correction (default: correction enabled)
 -h print this usage message and exit
 --ref/--cram-ref reference genome FASTA file for CRAM input

Transcript merge usage mode: 
  stringtie --merge [Options] { gtf_list | strg1.gtf ...}
With this option StringTie will assemble transcripts from multiple
input files generating a unified non-redundant set of isoforms. In this mode
the following options are available:
  -G <guide_gff>   reference annotation to include in the merging (GTF/GFF3)
  -o <out_gtf>     output file name for the merged transcripts GTF
                    (default: stdout)
  -m <min_len>     minimum input transcript length to include in the merge
                    (default: 50)
  -c <min_cov>     minimum input transcript coverage to include in the merge
                    (default: 0)
  -F <min_fpkm>    minimum input transcript FPKM to include in the merge
                    (default: 1.0)
  -T <min_tpm>     minimum input transcript TPM to include in the merge
                    (default: 1.0)
  -f <min_iso>     minimum isoform fraction (default: 0.01)
  -g <gap_len>     gap between transcripts to merge together (default: 250)
  -i               keep merged transcripts with retained introns; by default
                   these are not kept unless there is strong evidence for them
  -l <label>       name prefix for output transcripts (default: MSTRG)

Back to Top

Installation

Source code is obtained from StringTie

System

64-bit Linux