GET HOMOLOGUES-Teaching: Difference between revisions

From Research Computing Center Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 12: Line 12:
   
   
=== Author / Distributor ===
=== Author / Distributor ===
[https://github.com/eead-csic-compbio/get_homologues GET_HOMOLOGUES]
[https://github.com/eead-csic-compbio/get_homologues GET_HOMOLOGUES]
   
   
=== Description ===
=== Description ===
"
"a versatile software package for pan-genome analysis.". More details are at [https://github.com/eead-csic-compbio/get_homologues GET_HOMOLOGUES]
a versatile software package for pan-genome analysis"
More details are at [https://github.com/eead-csic-compbio/get_homologues GET_HOMOLOGUES]


=== Running Program ===
=== Running Program ===


The last version of this application is at /usr/local/apps/gb/GETHOMOLOGUES/1.7.6
* Version 1.7.6, installed at /usr/local/apps/gb/GETHOMOLOGUES/1.7.6


To use this version, please load the module with
To use this version, please load the module with
<pre class="gscript">
<pre class="gscript">
ml GETHOMOLOGUES/1.7.6  
ml GETHOMOLOGUES/1.7.6  
</pre>  
</pre>


Here is an example of a shell script, sub.sh, to run on the batch queue:  
Here is an example of a shell script, sub.sh, to run on the batch queue:  


<div class="gscript2">
<pre class="gscript">
<nowiki>#</nowiki>!/bin/bash<br>
#PBS -S /bin/bash
<nowiki>#</nowiki>SBATCH --job-name=j_GET_HOMOLOGUES<br>
#PBS -q batch
<nowiki>#</nowiki>SBATCH --partition=batch<br>       
#PBS -N j_gffread
<nowiki>#</nowiki>SBATCH --mail-type=ALL<br>
#PBS -l nodes=1:ppn=1:AMD
<nowiki>#</nowiki>SBATCH --mail-user=<u>username@uga.edu</u><br> 
#PBS -l walltime=4:00:00
<nowiki>#</nowiki>SBATCH --ntasks=<u>1</u><br> 
#PBS -l mem=2gb
<nowiki>#</nowiki>SBATCH --mem=<u>10gb</u><br>   
 
<nowiki>#</nowiki>SBATCH --time=<u>08:00:00</u><br> 
cd $PBS_O_WORKDIR
<nowiki>#</nowiki>SBATCH --output=GET_HOMOLOGUES.%j.out<br>
 
<nowiki>#</nowiki>SBATCH --error=GET_HOMOLOGUES.%j.err<br>
cd $SLURM_SUBMIT_DIR<br>
ml GETHOMOLOGUES/1.7.6<br>     
ml GETHOMOLOGUES/1.7.6<br>     
perl /usr/local/apps/gb/GETHOMOLOGUES/1.7.6/get_homologues.pl <u>[options]</u><br> 
get_homologues.pl [options]
</div>
</pre>
In the real submission script, at least all the above underlined values need to be reviewed or to be replaced by the proper values. 
 
 


Please refer to [[Running_Jobs_on_the_teaching_cluster]], [[Running_Jobs_on_the_teaching_cluster#Running_an_X-windows_application | Run X window Jobs]] and [[Running_Jobs_on_the_teaching_cluster#How_to_open_an_interactive_session | Run interactive Jobs]] for more details of running jobs at Teaching cluster.
Please refer to [[Running_Jobs_on_the_teaching_cluster]], [[Running_Jobs_on_the_teaching_cluster#Running_an_X-windows_application | Run X window Jobs]] and [[Running_Jobs_on_the_teaching_cluster#How_to_open_an_interactive_session | Run interactive Jobs]] for more details of running jobs at Teaching cluster.

Revision as of 10:57, 30 August 2019

Category

Bioinformatics

Program On

Sapelo2

Version

1.7.6

Author / Distributor

GET_HOMOLOGUES

Description

"a versatile software package for pan-genome analysis.". More details are at GET_HOMOLOGUES

Running Program

  • Version 1.7.6, installed at /usr/local/apps/gb/GETHOMOLOGUES/1.7.6

To use this version, please load the module with

ml GETHOMOLOGUES/1.7.6 

Here is an example of a shell script, sub.sh, to run on the batch queue:

#PBS -S /bin/bash
#PBS -q batch
#PBS -N j_gffread
#PBS -l nodes=1:ppn=1:AMD
#PBS -l walltime=4:00:00
#PBS -l mem=2gb

cd $PBS_O_WORKDIR

ml GETHOMOLOGUES/1.7.6<br>    
get_homologues.pl [options]


Please refer to Running_Jobs_on_the_teaching_cluster, Run X window Jobs and Run interactive Jobs for more details of running jobs at Teaching cluster.


Here is an example of job submission command:

sbatch ./sub.sh 

Documentation

ml GETHOMOLOGUES/1.7.6 
perl /usr/local/apps/gb/GETHOMOLOGUES/1.7.6/get_homologues.pl -h
usage: /usr/local/apps/gb/GETHOMOLOGUES/1.7.6/get_homologues.pl [options]

-h this message
-v print version, credits and checks installation
-d directory with input FASTA files ( .faa / .fna ),           (overrides -i,
   GenBank files ( .gbk ), 1 per genome, or a subdirectory      use of pre-clustered sequences
   ( subdir.clusters / subdir_ ) with pre-clustered sequences   ignores -c, -g)
   ( .faa / .fna ); allows for new files to be added later;    
   creates output folder named 'directory_homologues'
-i input amino acid FASTA file with [taxon names] in headers,  (required unless -d is set)
   creates output folder named 'file_homologues'

Optional parameters:
-o only run BLAST/Pfam searches and exit                       (useful to pre-compute searches)
-c report genome composition analysis                          (follows order in -I file if enforced,
                                                                ignores -r,-t,-e)
-R set random seed for genome composition analysis             (optional, requires -c, example -R 1234,
                                                                required for mixing -c with -c -a runs)
-s save memory by using BerkeleyDB; default parsing stores
   sequence hits in RAM
-m runmode [local|cluster]                                     (default local)
-n nb of threads for BLAST/HMMER/MCL in 'local' runmode        (default=2)
-I file with .faa/.gbk files in -d to be included              (takes all by default, requires -d)

Algorithms instead of default bidirectional best-hits (BDBH):
-G use COGtriangle algorithm (COGS, PubMed=20439257)           (requires 3+ genomes|taxa)
-M use orthoMCL algorithm (OMCL, PubMed=12952885)

Options that control sequence similarity searches:
-X use diamond instead of blastp                               (optional, set threads with -n)
-C min %coverage in BLAST pairwise alignments                  (range [1-100],default=75)
-E max E-value                                                 (default=1e-05,max=0.01)
-D require equal Pfam domain composition                       (best with -m cluster or -n threads)
   when defining similarity-based orthology
-S min %sequence identity in BLAST query/subj pairs            (range [1-100],default=1 [BDBH|OMCL])
-N min BLAST neighborhood correlation PubMed=18475320          (range [0,1],default=0 [BDBH|OMCL])
-b compile core-genome with minimum BLAST searches             (ignores -c [BDBH])

Options that control clustering:
-t report sequence clusters including at least t taxa          (default t=numberOfTaxa,
                                                                t=0 reports all clusters [OMCL|COGS])
-a report clusters of sequence features in GenBank files       (requires -d and .gbk files,
   instead of default 'CDS' GenBank features                    example -a 'tRNA,rRNA',
                                                                NOTE: uses blastn instead of blastp,
                                                                ignores -g,-D)
-g report clusters of intergenic sequences flanked by ORFs     (requires -d and .gbk files)
   in addition to default 'CDS' clusters
-f filter by %length difference within clusters                (range [1-100], by default sequence
                                                                length is not checked)
-r reference proteome .faa/.gbk file                           (by default takes file with
                                                                least sequences; with BDBH sets
                                                                first taxa to start adding genes)
-e exclude clusters with inparalogues                          (by default inparalogues are
                                                                included)
-x allow sequences in multiple COG clusters                    (by default sequences are allocated
                                                                to single clusters [COGS])
-F orthoMCL inflation value                                    (range [1-5], default=1.5 [OMCL])
-A calculate average identity of clustered sequences,          (optional, creates tab-separated matrix,
 by default uses blastp results but can use blastn with -a      recommended with -t 0 [OMCL|COGS])
-P calculate percentage of conserved proteins (POCP),          (optional, creates tab-separated matrix,
 by default uses blastp results but can use blastn with -a      recommended with -t 0 [OMCL|COGS])
-z add soft-core to genome composition analysis                (optional, requires -c [OMCL|COGS])

 This program uses BLAST (and optionally HMMER/Pfam) to define clusters of 'orthologous'
 genomic sequences and pan/core-genome gene sets. Several algorithms are available
 and search parameters are customizable. It is designed to process (in a SGE computer
 cluster) files contained in a directory (-d), so that new .faa/.gbk files can be added
 while conserving previous BLAST results. In general the program will try to re-use
 previous results when run with the same input directory.

Back to Top

Installation

Source code is obtained from GET_HOMOLOGUES

System

64-bit Linux