GET HOMOLOGUES-Teaching: Difference between revisions

From Research Computing Center Wiki
Jump to navigation Jump to search
(Created page with "Category:TeachingCategory:SoftwareCategory:Bioinformatics === Category === Bioinformatics === Program On === Teaching === Version === 1.7.6 === A...")
 
No edit summary
Line 45: Line 45:
cd $SLURM_SUBMIT_DIR<br>
cd $SLURM_SUBMIT_DIR<br>
ml GETHOMOLOGUES/1.7.6<br>     
ml GETHOMOLOGUES/1.7.6<br>     
perl get_homologues.pl <u>[options]</u><br>   
perl /usr/local/apps/gb/GETHOMOLOGUES/1.7.6/get_homologues.pl <u>[options]</u><br>   
</div>
</div>
In the real submission script, at least all the above underlined values need to be reviewed or to be replaced by the proper values.   
In the real submission script, at least all the above underlined values need to be reviewed or to be replaced by the proper values.   
Line 61: Line 61:
<pre  class="gcommand">
<pre  class="gcommand">
ml GETHOMOLOGUES/1.7.6  
ml GETHOMOLOGUES/1.7.6  
perl get_homologues.pl  -h
perl /usr/local/apps/gb/GETHOMOLOGUES/1.7.6/get_homologues.pl -h
usage: get_homologues.pl [options]
[https://github.com/eead-csic-compbio/get_homologues GET_HOMOLOGUES]
 
-h this message
-v print version, credits and checks installation
-d directory with input FASTA files ( .faa / .fna ),          (overrides -i,
  GenBank files ( .gbk ), 1 per genome, or a subdirectory      use of pre-clustered sequences
  ( subdir.clusters / subdir_ ) with pre-clustered sequences  ignores -c, -g)
  ( .faa / .fna ); allows for new files to be added later;   
  creates output folder named 'directory_homologues'
-i input amino acid FASTA file with [taxon names] in headers,  (required unless -d is set)
  creates output folder named 'file_homologues'
 
Optional parameters:
-o only run BLAST/Pfam searches and exit                      (useful to pre-compute searches)
-c report genome composition analysis                          (follows order in -I file if enforced,
                                                                ignores -r,-t,-e)
-R set random seed for genome composition analysis            (optional, requires -c, example -R 1234,
                                                                required for mixing -c with -c -a runs)
-s save memory by using BerkeleyDB; default parsing stores
  sequence hits in RAM
-m runmode [local|cluster]                                    (default local)
-n nb of threads for BLAST/HMMER/MCL in 'local' runmode        (default=2)
-I file with .faa/.gbk files in -d to be included              (takes all by default, requires -d)
 
Algorithms instead of default bidirectional best-hits (BDBH):
-G use COGtriangle algorithm (COGS, PubMed=20439257)          (requires 3+ genomes|taxa)
-M use orthoMCL algorithm (OMCL, PubMed=12952885)
 
Options that control sequence similarity searches:
-X use diamond instead of blastp                              (optional, set threads with -n)
-C min %coverage in BLAST pairwise alignments                  (range [1-100],default=75)
-E max E-value                                                (default=1e-05,max=0.01)
-D require equal Pfam domain composition                      (best with -m cluster or -n threads)
  when defining similarity-based orthology
-S min %sequence identity in BLAST query/subj pairs            (range [1-100],default=1 [BDBH|OMCL])
-N min BLAST neighborhood correlation PubMed=18475320          (range [0,1],default=0 [BDBH|OMCL])
-b compile core-genome with minimum BLAST searches            (ignores -c [BDBH])
 
Options that control clustering:
-t report sequence clusters including at least t taxa          (default t=numberOfTaxa,
                                                                t=0 reports all clusters [OMCL|COGS])
-a report clusters of sequence features in GenBank files      (requires -d and .gbk files,
  instead of default 'CDS' GenBank features                    example -a 'tRNA,rRNA',
                                                                NOTE: uses blastn instead of blastp,
                                                                ignores -g,-D)
-g report clusters of intergenic sequences flanked by ORFs    (requires -d and .gbk files)
  in addition to default 'CDS' clusters
-f filter by %length difference within clusters                (range [1-100], by default sequence
                                                                length is not checked)
-r reference proteome .faa/.gbk file                          (by default takes file with
                                                                least sequences; with BDBH sets
                                                                first taxa to start adding genes)
-e exclude clusters with inparalogues                          (by default inparalogues are
                                                                included)
-x allow sequences in multiple COG clusters                    (by default sequences are allocated
                                                                to single clusters [COGS])
-F orthoMCL inflation value                                    (range [1-5], default=1.5 [OMCL])
-A calculate average identity of clustered sequences,          (optional, creates tab-separated matrix,
by default uses blastp results but can use blastn with -a      recommended with -t 0 [OMCL|COGS])
-P calculate percentage of conserved proteins (POCP),          (optional, creates tab-separated matrix,
by default uses blastp results but can use blastn with -a      recommended with -t 0 [OMCL|COGS])
-z add soft-core to genome composition analysis                (optional, requires -c [OMCL|COGS])
 
This program uses BLAST (and optionally HMMER/Pfam) to define clusters of 'orthologous'
genomic sequences and pan/core-genome gene sets. Several algorithms are available
and search parameters are customizable. It is designed to process (in a SGE computer
cluster) files contained in a directory (-d), so that new .faa/.gbk files can be added
while conserving previous BLAST results. In general the program will try to re-use
previous results when run with the same input directory.
</pre>
</pre>
[[#top|Back to Top]]
[[#top|Back to Top]]

Revision as of 11:18, 14 November 2018

Category

Bioinformatics

Program On

Teaching

Version

1.7.6

Author / Distributor

GET_HOMOLOGUES

Description

" a versatile software package for pan-genome analysis" More details are at GET_HOMOLOGUES

Running Program

The last version of this application is at /usr/local/apps/gb/GETHOMOLOGUES/1.7.6

To use this version, please load the module with

ml GETHOMOLOGUES/1.7.6 

Here is an example of a shell script, sub.sh, to run on the batch queue:

#!/bin/bash
#SBATCH --job-name=j_GET_HOMOLOGUES
#SBATCH --partition=batch
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@uga.edu
#SBATCH --ntasks=1
#SBATCH --mem=10gb
#SBATCH --time=08:00:00
#SBATCH --output=GET_HOMOLOGUES.%j.out
#SBATCH --error=GET_HOMOLOGUES.%j.err

cd $SLURM_SUBMIT_DIR
ml GETHOMOLOGUES/1.7.6
perl /usr/local/apps/gb/GETHOMOLOGUES/1.7.6/get_homologues.pl [options]

In the real submission script, at least all the above underlined values need to be reviewed or to be replaced by the proper values.

Please refer to Running_Jobs_on_the_teaching_cluster, Run X window Jobs and Run interactive Jobs for more details of running jobs at Teaching cluster.


Here is an example of job submission command:

sbatch ./sub.sh 

Documentation

ml GETHOMOLOGUES/1.7.6 
perl /usr/local/apps/gb/GETHOMOLOGUES/1.7.6/get_homologues.pl -h
[https://github.com/eead-csic-compbio/get_homologues GET_HOMOLOGUES]

Back to Top

Installation

Source code is obtained from GET_HOMOLOGUES

System

64-bit Linux