GET HOMOLOGUES-Teaching: Difference between revisions

From Research Computing Center Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
 
(5 intermediate revisions by the same user not shown)
Line 1: Line 1:
[[Category:Sapelo2]][[Category:Software]][[Category:Bioinformatics]]
[[Category:Teaching]][[Category:Software]][[Category:Bioinformatics]]
=== Category ===
=== Category ===


Line 6: Line 6:
=== Program On ===
=== Program On ===


Sapelo2
Teaching


=== Version ===
=== Version ===
Line 26: Line 26:
</pre>
</pre>


Here is an example of a shell script, sub.sh, to run on the batch queue:  
ere is an example of a shell script, sub.sh, to run on the batch queue:  


<pre class="gscript">
<div class="gscript2">
#PBS -S /bin/bash
<nowiki>#</nowiki>!/bin/bash<br>
#PBS -q batch
<nowiki>#</nowiki>SBATCH --job-name=j_GLIMMER<br>
#PBS -N j_gffread
<nowiki>#</nowiki>SBATCH --partition=batch<br>       
#PBS -l nodes=1:ppn=1:AMD
<nowiki>#</nowiki>SBATCH --mail-type=ALL<br>
#PBS -l walltime=4:00:00
<nowiki>#</nowiki>SBATCH --mail-user=<u>username@uga.edu</u><br> 
#PBS -l mem=2gb
<nowiki>#</nowiki>SBATCH --ntasks=<u>1</u><br> 
<nowiki>#</nowiki>SBATCH --mem=<u>10gb</u><br>   
<nowiki>#</nowiki>SBATCH --time=<u>08:00:00</u><br> 
<nowiki>#</nowiki>SBATCH --output=GLIMMER.%j.out<br>
<nowiki>#</nowiki>SBATCH --error=GLIMMER.%j.err<br>
cd $SLURM_SUBMIT_DIR<br>
ml GETHOMOLOGUES/1.7.6


cd $PBS_O_WORKDIR
perl /usr/local/apps/gb/GETHOMOLOGUES/1.7.6/get_homologues.pl [options]
</div>


ml GETHOMOLOGUES/1.7.6<br>   
In the real submission script, at least all the above underlined values need to be reviewed or to be replaced by the proper values.
get_homologues.pl [options]
</pre>


where [options] need to be replaced by the options (command and arguments) you want to use.  Other parameters of the job, such as the maximum wall clock time, maximum memory, and the job name need to be modified appropriately as well.  
Please refer to [[Running_Jobs_on_the_teaching_cluster]], [[Running_Jobs_on_the_teaching_cluster#Running_an_X-windows_application | Run X window Jobs]] and [[Running_Jobs_on_the_teaching_cluster#How_to_open_an_interactive_session | Run interactive Jobs]] for more details of running jobs at Teaching cluster.


Submit the job to the queue with
Here is an example of job submission command:
<pre class="gcommand">
<pre class="gcommand">
qsub sub.sh
sbatch ./sub.sh  
</pre>
</pre>


Line 53: Line 59:
<pre  class="gcommand">
<pre  class="gcommand">
ml GETHOMOLOGUES/1.7.6  
ml GETHOMOLOGUES/1.7.6  
get_homologues.pl -h
perl /usr/local/apps/gb/GETHOMOLOGUES/1.7.6/get_homologues.pl -h


usage: /usr/local/apps/gb/GETHOMOLOGUES/1.7.6/get_homologues.pl [options]
usage: /usr/local/apps/gb/GETHOMOLOGUES/1.7.6/get_homologues.pl [options]
Line 125: Line 131:
  previous results when run with the same input directory.
  previous results when run with the same input directory.
</pre>
</pre>
[[#top|Back to Top]]
[[#top|Back to Top]]



Latest revision as of 11:04, 30 August 2019

Category

Bioinformatics

Program On

Teaching

Version

1.7.6

Author / Distributor

GET_HOMOLOGUES

Description

"a versatile software package for pan-genome analysis.". More details are at GET_HOMOLOGUES

Running Program

  • Version 1.7.6, installed at /usr/local/apps/gb/GETHOMOLOGUES/1.7.6

To use this version, please load the module with

ml GETHOMOLOGUES/1.7.6 

ere is an example of a shell script, sub.sh, to run on the batch queue:

#!/bin/bash
#SBATCH --job-name=j_GLIMMER
#SBATCH --partition=batch
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@uga.edu
#SBATCH --ntasks=1
#SBATCH --mem=10gb
#SBATCH --time=08:00:00
#SBATCH --output=GLIMMER.%j.out
#SBATCH --error=GLIMMER.%j.err

cd $SLURM_SUBMIT_DIR
ml GETHOMOLOGUES/1.7.6

perl /usr/local/apps/gb/GETHOMOLOGUES/1.7.6/get_homologues.pl [options]

In the real submission script, at least all the above underlined values need to be reviewed or to be replaced by the proper values.

Please refer to Running_Jobs_on_the_teaching_cluster, Run X window Jobs and Run interactive Jobs for more details of running jobs at Teaching cluster.

Here is an example of job submission command:

sbatch ./sub.sh 

Documentation

ml GETHOMOLOGUES/1.7.6 
perl /usr/local/apps/gb/GETHOMOLOGUES/1.7.6/get_homologues.pl -h

usage: /usr/local/apps/gb/GETHOMOLOGUES/1.7.6/get_homologues.pl [options]

-h this message
-v print version, credits and checks installation
-d directory with input FASTA files ( .faa / .fna ),           (overrides -i,
   GenBank files ( .gbk ), 1 per genome, or a subdirectory      use of pre-clustered sequences
   ( subdir.clusters / subdir_ ) with pre-clustered sequences   ignores -c, -g)
   ( .faa / .fna ); allows for new files to be added later;    
   creates output folder named 'directory_homologues'
-i input amino acid FASTA file with [taxon names] in headers,  (required unless -d is set)
   creates output folder named 'file_homologues'

Optional parameters:
-o only run BLAST/Pfam searches and exit                       (useful to pre-compute searches)
-c report genome composition analysis                          (follows order in -I file if enforced,
                                                                ignores -r,-t,-e)
-R set random seed for genome composition analysis             (optional, requires -c, example -R 1234,
                                                                required for mixing -c with -c -a runs)
-s save memory by using BerkeleyDB; default parsing stores
   sequence hits in RAM
-m runmode [local|cluster]                                     (default local)
-n nb of threads for BLAST/HMMER/MCL in 'local' runmode        (default=2)
-I file with .faa/.gbk files in -d to be included              (takes all by default, requires -d)

Algorithms instead of default bidirectional best-hits (BDBH):
-G use COGtriangle algorithm (COGS, PubMed=20439257)           (requires 3+ genomes|taxa)
-M use orthoMCL algorithm (OMCL, PubMed=12952885)

Options that control sequence similarity searches:
-X use diamond instead of blastp                               (optional, set threads with -n)
-C min %coverage in BLAST pairwise alignments                  (range [1-100],default=75)
-E max E-value                                                 (default=1e-05,max=0.01)
-D require equal Pfam domain composition                       (best with -m cluster or -n threads)
   when defining similarity-based orthology
-S min %sequence identity in BLAST query/subj pairs            (range [1-100],default=1 [BDBH|OMCL])
-N min BLAST neighborhood correlation PubMed=18475320          (range [0,1],default=0 [BDBH|OMCL])
-b compile core-genome with minimum BLAST searches             (ignores -c [BDBH])

Options that control clustering:
-t report sequence clusters including at least t taxa          (default t=numberOfTaxa,
                                                                t=0 reports all clusters [OMCL|COGS])
-a report clusters of sequence features in GenBank files       (requires -d and .gbk files,
   instead of default 'CDS' GenBank features                    example -a 'tRNA,rRNA',
                                                                NOTE: uses blastn instead of blastp,
                                                                ignores -g,-D)
-g report clusters of intergenic sequences flanked by ORFs     (requires -d and .gbk files)
   in addition to default 'CDS' clusters
-f filter by %length difference within clusters                (range [1-100], by default sequence
                                                                length is not checked)
-r reference proteome .faa/.gbk file                           (by default takes file with
                                                                least sequences; with BDBH sets
                                                                first taxa to start adding genes)
-e exclude clusters with inparalogues                          (by default inparalogues are
                                                                included)
-x allow sequences in multiple COG clusters                    (by default sequences are allocated
                                                                to single clusters [COGS])
-F orthoMCL inflation value                                    (range [1-5], default=1.5 [OMCL])
-A calculate average identity of clustered sequences,          (optional, creates tab-separated matrix,
 by default uses blastp results but can use blastn with -a      recommended with -t 0 [OMCL|COGS])
-P calculate percentage of conserved proteins (POCP),          (optional, creates tab-separated matrix,
 by default uses blastp results but can use blastn with -a      recommended with -t 0 [OMCL|COGS])
-z add soft-core to genome composition analysis                (optional, requires -c [OMCL|COGS])

 This program uses BLAST (and optionally HMMER/Pfam) to define clusters of 'orthologous'
 genomic sequences and pan/core-genome gene sets. Several algorithms are available
 and search parameters are customizable. It is designed to process (in a SGE computer
 cluster) files contained in a directory (-d), so that new .faa/.gbk files can be added
 while conserving previous BLAST results. In general the program will try to re-use
 previous results when run with the same input directory.

Back to Top

Installation

Source code is obtained from GET_HOMOLOGUES

System

64-bit Linux