Trinity-Teaching: Difference between revisions

From Research Computing Center Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
 
Line 44: Line 44:
cd $SLURM_SUBMIT_DIR<br>
cd $SLURM_SUBMIT_DIR<br>
ml Trinity/2.6.6-foss-2016b<br>     
ml Trinity/2.6.6-foss-2016b<br>     
singularityexec/usr/local/singularity-images/trinity-2.5.1--0.simgTrinity <u>[options]</u><br>   
Trinity <u>[options]</u><br>   
</div>
</div>
In the real submission script, at least all the above underlined values need to be reviewed or to be replaced by the proper values.   
In the real submission script, at least all the above underlined values need to be reviewed or to be replaced by the proper values.   
Line 60: Line 60:
<pre  class="gcommand">
<pre  class="gcommand">
ml Trinity/2.6.6-foss-2016b  
ml Trinity/2.6.6-foss-2016b  
singularityexec/usr/local/singularity-images/trinity-2.5.1--0.simgTrinity --show_full_usage_info
Trinity --show_full_usage_info
 
 
 
###############################################################################
#
 
    ______  ____  ____  ____  ____  ______  __ __
    |      ||    \ |    ||    \ |    ||      ||  |  |
    |      ||  D  ) |  | |  _  | |  | |      ||  |  |
    |_|  |_||    /  |  | |  |  | |  | |_|  |_||  ~  |
      |  |  |    \  |  | |  |  | |  |  |  |  |___, |
      |  |  |  .  \ |  | |  |  | |  |  |  |  |    |
      |__|  |__|\_||____||__|__||____|  |__|  |____/
 
#
#
# Required:
#
#  --seqType <string>      :type of reads: ('fa' or 'fq')
#
#  --max_memory <string>      :suggested max memory to use by Trinity where limiting can be enabled. (jellyfish, sorting, etc)
#                            provided in Gb of RAM, ie.  '--max_memory 10G'
#
#  If paired reads:
#      --left  <string>    :left reads, one or more file names (separated by commas, no spaces)
#      --right <string>    :right reads, one or more file names (separated by commas, no spaces)
#
#  Or, if unpaired reads:
#      --single <string>  :single reads, one or more file names, comma-delimited (note, if single file contains pairs, can use flag: --run_as_paired )
#
#  Or,
#      --samples_file <string>        tab-delimited text file indicating biological replicate relationships.
#                                  ex.
#                                        cond_A    cond_A_rep1    A_rep1_left.fq    A_rep1_right.fq
#                                        cond_A    cond_A_rep2    A_rep2_left.fq    A_rep2_right.fq
#                                        cond_B    cond_B_rep1    B_rep1_left.fq    B_rep1_right.fq
#                                        cond_B    cond_B_rep2    B_rep2_left.fq    B_rep2_right.fq
#
#                      # if single-end instead of paired-end, then leave the 4th column above empty.
#
####################################
##  Misc:  #########################
#
#  --SS_lib_type <string>          :Strand-specific RNA-Seq read orientation.
#                                  if paired: RF or FR,
#                                  if single: F or R.  (dUTP method = RF)
#                                  See web documentation.
#
#  --CPU <int>                    :number of CPUs to use, default: 2
#  --min_contig_length <int>      :minimum assembled contig length to report
#                                  (def=200)
#
#  --long_reads <string>          :fasta file containing error-corrected or circular consensus (CCS) pac bio reads
#                                  (** note: experimental parameter **, this functionality continues to be under development)
#
#  --genome_guided_bam <string>    :genome guided mode, provide path to coordinate-sorted bam file.
#                                  (see genome-guided param section under --show_full_usage_info)
#
#  --jaccard_clip                  :option, set if you have paired reads and
#                                  you expect high gene density with UTR
#                                  overlap (use FASTQ input file format
#                                  for reads).
#                                  (note: jaccard_clip is an expensive
#                                  operation, so avoid using it unless
#                                  necessary due to finding excessive fusion
#                                  transcripts w/o it.)
#
#  --trimmomatic                  :run Trimmomatic to quality trim reads
#                                        see '--quality_trimming_params' under full usage info for tailored settings.
#                                 
#
#  --no_normalize_reads            :Do *not* run in silico normalization of reads. Defaults to max. read coverage of 50.
#                                      see '--normalize_max_read_cov' under full usage info for tailored settings.
#                                      (note, as of Sept 21, 2016, normalization is on by default)
#   
#  --no_distributed_trinity_exec  :do not run Trinity phase 2 (assembly of partitioned reads), and stop after generating command list.
#
#
#  --output <string>              :name of directory for output (will be
#                                  created if it doesn't already exist)
#                                  default( your current working directory: "/home/yhuang/projects/kpan/1.5.9/trinity_out_dir"
#                                    note: must include 'trinity' in the name as a safety precaution! )
#           
#  --workdir <string>              :where Trinity phase-2 assembly computation takes place (defaults to --output setting).
#                                  (can set this to a node-local drive or RAM disk)   
#  --full_cleanup                  :only retain the Trinity fasta file, rename as ${output_dir}.Trinity.fasta
#
#  --cite                          :show the Trinity literature citation
#
#  --verbose                      :provide additional job status info during the run.
#
#  --version                      :reports Trinity version (Trinity-v2.6.6) and exits.
#
#  --show_full_usage_info          :show the many many more options available for running Trinity (expert usage).
 
#
#  --KMER_SIZE <int>              :kmer length to use (default: 25)  max=32
#
#  --prep                          :Only prepare files (high I/O usage) and stop before kmer counting.
#
#  --no_cleanup                    :retain all intermediate input files.
#
#  --no_version_check              :dont run a network check to determine if software updates are available.
#
#  --monitoring                    :use collectl to monitor all steps of Trinity
#    --monitor_sec <int>          : number of seconds for each interval of runtime monitoring (default: 60)
####################################################
# Inchworm and K-mer counting-related options: #####
#
#  --min_kmer_cov <int>          :min count for K-mers to be assembled by
#                                  Inchworm (default: 1)
#  --inchworm_cpu <int>          :number of CPUs to use for Inchworm, default is min(6, --CPU option)
#
#  --no_run_inchworm              :stop after running jellyfish, before inchworm. (phase 1, read clustering only)
#
###################################
# Chrysalis-related options: ######
#
#  --max_reads_per_graph <int>    :maximum number of reads to anchor within
#                                  a single graph (default: 200000)
#  --min_glue <int>              :min number of reads needed to glue two inchworm contigs
#                                  together. (default: 2)
#
#  --no_bowtie                    :dont run bowtie to use pair info in chrysalis clustering.
#
#  --no_run_chrysalis            :stop after running inchworm, before chrysalis. (phase 1, read clustering only)
#
#####################################
###  Butterfly-related options:  ####
#
#  --bfly_opts <string>            :additional parameters to pass through to butterfly
#                                  (see butterfly options: java -jar Butterfly.jar ).
#                                  (note: only for expert or experimental use.  Commonly used parameters are exposed through this Trinity menu here).
#
#
#  Butterfly read-pair grouping settings (used to define 'pair paths'):
#
#  --group_pairs_distance <int>    :maximum length expected between fragment pairs (default: 500)
#                                  (reads outside this distance are treated as single-end)
#
#  ///////////////////////////////////////////////
#  Butterfly default reconstruction mode settings.
#                                 
#  --path_reinforcement_distance <int>  :minimum overlap of reads with growing transcript
#                                        path (default: PE: 25, SE: 25)
#                                        Set to 1 for the most lenient path extension requirements.
#
#
#  /////////////////////////////////////////
#  Butterfly transcript reduction settings:
#
#  --no_path_merging            : all final transcript candidates are output (including SNP variations, however, some SNPs may be unphased) 
#
#  By default, alternative transcript candidates are merged (in reality, discarded) if they are found to be too similar, according to the following logic:
#
#  (identity=(numberOfMatches/shorterLen) > 95.0% or if we have <= 2 mismatches) and if we have internal gap lengths <= 10
#
#  with parameters as:
#     
#      --min_per_id_same_path <int>          default: 98    min percent identity for two paths to be merged into single paths
#      --max_diffs_same_path <int>          default: 2      max allowed differences encountered between path sequences to combine them
#      --max_internal_gap_same_path <int>    default: 10    maximum number of internal consecutive gap characters allowed for paths to be merged into single paths.
#
#      If, in a comparison between two alternative transcripts, they are found too similar, the transcript with the greatest cumulative
#      compatible read (pair-path) support is retained, and the other is discarded.
#
#
#  //////////////////////////////////////////////
#  Butterfly Java and parallel execution settings.
#
#  --bflyHeapSpaceMax <string>    :java max heap space setting for butterfly
#                                  (default: 4G) => yields command
#                  'java -Xmx4G -jar Butterfly.jar ... $bfly_opts'
#  --bflyHeapSpaceInit <string>    :java initial heap space settings for
#                                  butterfly (default: 1G) => yields command
#                  'java -Xms1G -jar Butterfly.jar ... $bfly_opts'
#  --bflyGCThreads <int>          :threads for garbage collection
#                                  (default: 2))
#  --bflyCPU <int>                :CPUs to use (default will be normal
#                                  number of CPUs; e.g., 2)
#  --bflyCalculateCPU              :Calculate CPUs based on 80% of max_memory
#                                  divided by maxbflyHeapSpaceMax
#
#  --bfly_jar <string>            : /path/to/Butterfly.jar, otherwise default
#                                    Trinity-installed version is used.
#                                   
#
################################################################################
#### Quality Trimming Options #### 
#
#  --quality_trimming_params <string>  defaults to: "ILLUMINACLIP:/usr/local/apps/eb/Trinity/2.6.6-foss-2016b/trinityrnaseq-Trinity-v2.6.6/trinity-plugins/Trimmomatic/adapters/TruSeq3-PE.fa:2:30:10 SLIDINGWINDOW:4:5 LEADING:5 TRAILING:5 MINLEN:25"
#
################################################################################
####  In silico Read Normalization Options ###
#
#  --normalize_max_read_cov <int>      defaults to 50
#  --normalize_by_read_set              run normalization separate for each pair of fastq files,
#                                      then one final normalization that combines the individual normalized reads.
#                                      Consider using this if RAM limitations are a consideration.
#
################################################################################
#### Genome-guided de novo assembly
#
#  * required:
#
# --genome_guided_max_intron <int>    :maximum allowed intron length (also maximum fragment span on genome)
#
#  * optional:
#
# --genome_guided_min_coverage <int>  :minimum read coverage for identifying and expressed region of the genome. (default: 1)
#
# --genome_guided_min_reads_per_partition <int>  :default min of 10 reads per partition
#
#
#######################################################################
# Trinity phase 2 (parallel assembly of read clusters) Options: #######
#
#  --grid_exec <string>                :your command-line utility for submitting jobs to the grid.
#                                        This should be a command line tool that accepts a single parameter:
#                                        ${your_submission_tool} /path/to/file/containing/commands.txt
#                                        and this submission tool should exit(0) upon successful
#                                        completion of all commands.
#
--grid_node_CPU <int>                number of threads for each parallel process to leverage. (default: 1)
#
#  --grid_node_max_memory <string>        max memory targeted for each grid node. (default: 1G)
#
#            The --grid_node_CPU and --grid_node_max_memory are applied as
#              the --CPU and --max_memory parameters for the Trinity jobs run in
#              Trinity Phase 2 (assembly of read clusters)
#
    #
#
###############################################################################
#
#  *Note, a typical Trinity command might be:
#
#        Trinity --seqType fq --max_memory 50G --left reads_1.fq  --right reads_2.fq --CPU 6
#
#
#    and for Genome-guided Trinity:
#
#        Trinity --genome_guided_bam rnaseq_alignments.csorted.bam --max_memory 50G
#                --genome_guided_max_intron 10000 --CPU 6
#
#    see: /usr/local/apps/eb/Trinity/2.6.6-foss-2016b/trinityrnaseq-Trinity-v2.6.6/sample_data/test_Trinity_Assembly/
#          for sample data and 'runMe.sh' for example Trinity execution
#
#    For more details, visit: http://trinityrnaseq.github.io
#
###############################################################################
 
 
   


</pre>
</pre>

Latest revision as of 11:24, 15 August 2018

Category

Bioinformatics

Program On

Teaching

Version

2.6.6

Author / Distributor

Trinity

Description

"Trinity represents a novel method for the efficient and robust de novo reconstruction of transcriptomes from RNA-Seq data. Trinity combines three independent software modules: Inchworm, Chrysalis, and Butterfly, applied sequentially to process large volumes of RNA-Seq reads." More details are at Trinity

Running Program

The last version of this application is at /usr/local/apps/eb/Trinity/2.6.6-foss-2016b

To use this version, please load the module with

ml Trinity/2.6.6-foss-2016b 

Here is an example of a shell script, sub.sh, to run on the batch queue:

#!/bin/bash
#SBATCH --job-name=j_Trinity
#SBATCH --partition=batch
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@uga.edu
#SBATCH --ntasks=1
#SBATCH --mem=10gb
#SBATCH --time=08:00:00
#SBATCH --output=Trinity.%j.out
#SBATCH --error=Trinity.%j.err

cd $SLURM_SUBMIT_DIR
ml Trinity/2.6.6-foss-2016b
Trinity [options]

In the real submission script, at least all the above underlined values need to be reviewed or to be replaced by the proper values.

Please refer to Running_Jobs_on_the_teaching_cluster, Run X window Jobs and Run interactive Jobs for more details of running jobs at Teaching cluster.


Here is an example of job submission command:

sbatch ./sub.sh 

Documentation

ml Trinity/2.6.6-foss-2016b 
Trinity --show_full_usage_info



###############################################################################
#

     ______  ____   ____  ____   ____  ______  __ __
    |      ||    \ |    ||    \ |    ||      ||  |  |
    |      ||  D  ) |  | |  _  | |  | |      ||  |  |
    |_|  |_||    /  |  | |  |  | |  | |_|  |_||  ~  |
      |  |  |    \  |  | |  |  | |  |   |  |  |___, |
      |  |  |  .  \ |  | |  |  | |  |   |  |  |     |
      |__|  |__|\_||____||__|__||____|  |__|  |____/

#
#
# Required:
#
#  --seqType <string>      :type of reads: ('fa' or 'fq')
#
#  --max_memory <string>      :suggested max memory to use by Trinity where limiting can be enabled. (jellyfish, sorting, etc)
#                            provided in Gb of RAM, ie.  '--max_memory 10G'
#
#  If paired reads:
#      --left  <string>    :left reads, one or more file names (separated by commas, no spaces)
#      --right <string>    :right reads, one or more file names (separated by commas, no spaces)
#
#  Or, if unpaired reads:
#      --single <string>   :single reads, one or more file names, comma-delimited (note, if single file contains pairs, can use flag: --run_as_paired )
#
#  Or,
#      --samples_file <string>         tab-delimited text file indicating biological replicate relationships.
#                                   ex.
#                                        cond_A    cond_A_rep1    A_rep1_left.fq    A_rep1_right.fq
#                                        cond_A    cond_A_rep2    A_rep2_left.fq    A_rep2_right.fq
#                                        cond_B    cond_B_rep1    B_rep1_left.fq    B_rep1_right.fq
#                                        cond_B    cond_B_rep2    B_rep2_left.fq    B_rep2_right.fq
#
#                      # if single-end instead of paired-end, then leave the 4th column above empty.
#
####################################
##  Misc:  #########################
#
#  --SS_lib_type <string>          :Strand-specific RNA-Seq read orientation.
#                                   if paired: RF or FR,
#                                   if single: F or R.   (dUTP method = RF)
#                                   See web documentation.
#
#  --CPU <int>                     :number of CPUs to use, default: 2
#  --min_contig_length <int>       :minimum assembled contig length to report
#                                   (def=200)
#
#  --long_reads <string>           :fasta file containing error-corrected or circular consensus (CCS) pac bio reads
#                                   (** note: experimental parameter **, this functionality continues to be under development)
#
#  --genome_guided_bam <string>    :genome guided mode, provide path to coordinate-sorted bam file.
#                                   (see genome-guided param section under --show_full_usage_info)
#
#  --jaccard_clip                  :option, set if you have paired reads and
#                                   you expect high gene density with UTR
#                                   overlap (use FASTQ input file format
#                                   for reads).
#                                   (note: jaccard_clip is an expensive
#                                   operation, so avoid using it unless
#                                   necessary due to finding excessive fusion
#                                   transcripts w/o it.)
#
#  --trimmomatic                   :run Trimmomatic to quality trim reads
#                                        see '--quality_trimming_params' under full usage info for tailored settings.
#                                  
#
#  --no_normalize_reads            :Do *not* run in silico normalization of reads. Defaults to max. read coverage of 50.
#                                       see '--normalize_max_read_cov' under full usage info for tailored settings.
#                                       (note, as of Sept 21, 2016, normalization is on by default)
#     
#  --no_distributed_trinity_exec   :do not run Trinity phase 2 (assembly of partitioned reads), and stop after generating command list.
#
#
#  --output <string>               :name of directory for output (will be
#                                   created if it doesn't already exist)
#                                   default( your current working directory: "/home/yhuang/projects/kpan/1.5.9/trinity_out_dir" 
#                                    note: must include 'trinity' in the name as a safety precaution! )
#             
#  --workdir <string>              :where Trinity phase-2 assembly computation takes place (defaults to --output setting).
#                                  (can set this to a node-local drive or RAM disk)     
#  
#  --full_cleanup                  :only retain the Trinity fasta file, rename as ${output_dir}.Trinity.fasta
#
#  --cite                          :show the Trinity literature citation
#
#  --verbose                       :provide additional job status info during the run.
#
#  --version                       :reports Trinity version (Trinity-v2.6.6) and exits.
#
#  --show_full_usage_info          :show the many many more options available for running Trinity (expert usage).

#
#  --KMER_SIZE <int>               :kmer length to use (default: 25)  max=32
#
#  --prep                          :Only prepare files (high I/O usage) and stop before kmer counting.
#
#  --no_cleanup                    :retain all intermediate input files.
#
#  --no_version_check              :dont run a network check to determine if software updates are available.
#
#  --monitoring                    :use collectl to monitor all steps of Trinity
#     --monitor_sec <int>          : number of seconds for each interval of runtime monitoring (default: 60)
#  
####################################################
# Inchworm and K-mer counting-related options: #####
#
#  --min_kmer_cov <int>           :min count for K-mers to be assembled by
#                                  Inchworm (default: 1)
#  --inchworm_cpu <int>           :number of CPUs to use for Inchworm, default is min(6, --CPU option)
#
#  --no_run_inchworm              :stop after running jellyfish, before inchworm. (phase 1, read clustering only)
#
###################################
# Chrysalis-related options: ######
#
#  --max_reads_per_graph <int>    :maximum number of reads to anchor within
#                                  a single graph (default: 200000)
#  --min_glue <int>               :min number of reads needed to glue two inchworm contigs
#                                  together. (default: 2) 
#
#  --no_bowtie                    :dont run bowtie to use pair info in chrysalis clustering.
#
#  --no_run_chrysalis             :stop after running inchworm, before chrysalis. (phase 1, read clustering only)
#
#####################################
###  Butterfly-related options:  ####
#
#  --bfly_opts <string>            :additional parameters to pass through to butterfly
#                                   (see butterfly options: java -jar Butterfly.jar ).
#                                   (note: only for expert or experimental use.  Commonly used parameters are exposed through this Trinity menu here).
#
#
#  Butterfly read-pair grouping settings (used to define 'pair paths'):
#
#  --group_pairs_distance <int>    :maximum length expected between fragment pairs (default: 500)
#                                   (reads outside this distance are treated as single-end)
#
#  ///////////////////////////////////////////////
#  Butterfly default reconstruction mode settings.
#                                   
#  --path_reinforcement_distance <int>   :minimum overlap of reads with growing transcript 
#                                         path (default: PE: 25, SE: 25)
#                                         Set to 1 for the most lenient path extension requirements.
#
#
#  /////////////////////////////////////////
#  Butterfly transcript reduction settings:
#
#  --no_path_merging            : all final transcript candidates are output (including SNP variations, however, some SNPs may be unphased)  
#
#  By default, alternative transcript candidates are merged (in reality, discarded) if they are found to be too similar, according to the following logic:
#
#  (identity=(numberOfMatches/shorterLen) > 95.0% or if we have <= 2 mismatches) and if we have internal gap lengths <= 10
#
#  with parameters as:
#      
#      --min_per_id_same_path <int>          default: 98     min percent identity for two paths to be merged into single paths
#      --max_diffs_same_path <int>           default: 2      max allowed differences encountered between path sequences to combine them
#      --max_internal_gap_same_path <int>    default: 10     maximum number of internal consecutive gap characters allowed for paths to be merged into single paths.
#
#      If, in a comparison between two alternative transcripts, they are found too similar, the transcript with the greatest cumulative 
#      compatible read (pair-path) support is retained, and the other is discarded.
#
#
#  //////////////////////////////////////////////
#  Butterfly Java and parallel execution settings.
#
#  --bflyHeapSpaceMax <string>     :java max heap space setting for butterfly
#                                   (default: 4G) => yields command
#                  'java -Xmx4G -jar Butterfly.jar ... $bfly_opts'
#  --bflyHeapSpaceInit <string>    :java initial heap space settings for
#                                   butterfly (default: 1G) => yields command
#                  'java -Xms1G -jar Butterfly.jar ... $bfly_opts'
#  --bflyGCThreads <int>           :threads for garbage collection
#                                   (default: 2))
#  --bflyCPU <int>                 :CPUs to use (default will be normal 
#                                   number of CPUs; e.g., 2)
#  --bflyCalculateCPU              :Calculate CPUs based on 80% of max_memory
#                                   divided by maxbflyHeapSpaceMax
#
#  --bfly_jar <string>             : /path/to/Butterfly.jar, otherwise default
#                                    Trinity-installed version is used. 
#                                    
#
################################################################################
#### Quality Trimming Options ####  
# 
#  --quality_trimming_params <string>   defaults to: "ILLUMINACLIP:/usr/local/apps/eb/Trinity/2.6.6-foss-2016b/trinityrnaseq-Trinity-v2.6.6/trinity-plugins/Trimmomatic/adapters/TruSeq3-PE.fa:2:30:10 SLIDINGWINDOW:4:5 LEADING:5 TRAILING:5 MINLEN:25"
#
################################################################################
####  In silico Read Normalization Options ###
#
#  --normalize_max_read_cov <int>       defaults to 50
#  --normalize_by_read_set              run normalization separate for each pair of fastq files,
#                                       then one final normalization that combines the individual normalized reads.
#                                       Consider using this if RAM limitations are a consideration.
#
################################################################################
#### Genome-guided de novo assembly
# 
#  * required:
#
# --genome_guided_max_intron <int>     :maximum allowed intron length (also maximum fragment span on genome)
#
#  * optional:
#
# --genome_guided_min_coverage <int>   :minimum read coverage for identifying and expressed region of the genome. (default: 1)
#
# --genome_guided_min_reads_per_partition <int>   :default min of 10 reads per partition
#
#
#######################################################################
# Trinity phase 2 (parallel assembly of read clusters) Options: #######
#
#  --grid_exec <string>                 :your command-line utility for submitting jobs to the grid.
#                                        This should be a command line tool that accepts a single parameter:
#                                        ${your_submission_tool} /path/to/file/containing/commands.txt
#                                        and this submission tool should exit(0) upon successful 
#                                        completion of all commands.
#
#  --grid_node_CPU <int>                number of threads for each parallel process to leverage. (default: 1)
#
#  --grid_node_max_memory <string>         max memory targeted for each grid node. (default: 1G)
#
#            The --grid_node_CPU and --grid_node_max_memory are applied as 
#              the --CPU and --max_memory parameters for the Trinity jobs run in 
#              Trinity Phase 2 (assembly of read clusters)
#
    #
#
###############################################################################
#
#  *Note, a typical Trinity command might be:
#
#        Trinity --seqType fq --max_memory 50G --left reads_1.fq  --right reads_2.fq --CPU 6
#
#
#    and for Genome-guided Trinity:
#
#        Trinity --genome_guided_bam rnaseq_alignments.csorted.bam --max_memory 50G
#                --genome_guided_max_intron 10000 --CPU 6
#
#     see: /usr/local/apps/eb/Trinity/2.6.6-foss-2016b/trinityrnaseq-Trinity-v2.6.6/sample_data/test_Trinity_Assembly/
#          for sample data and 'runMe.sh' for example Trinity execution
#
#     For more details, visit: http://trinityrnaseq.github.io
#
###############################################################################


    

Back to Top

Installation

Source code is obtained from Trinity

System

64-bit Linux