GeneMarkES-Teaching: Difference between revisions

From Research Computing Center Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
 
Line 62: Line 62:
ml genemarkes/4.33  
ml genemarkes/4.33  
perl /usr/local/apps/gb/genemarkes/4.33/gmes_petap.pl  
perl /usr/local/apps/gb/genemarkes/4.33/gmes_petap.pl  
[http://exon.gatech.edu/GeneMark/ GeneMarkES]
# -------------------
Usage: /usr/local/apps/gb/genemarkes/4.33/gmes_petap.pl  [options]  --sequence [filename]
 
GeneMark-ES Suite version 4.35
  includes transcript (GeneMark-ET) and protein (GeneMark-EP) based training and prediction
 
Input sequence/s should be in FASTA format
 
Algorithm options
  --ES          to run self-training
  --fungus      to run algorithm with branch point model (most useful for fungal genomes)
  --ET          [filename]; to run training with introns coordinates from RNA-Seq read alignments (GFF format)
  --EP          [filename]; to run training with introns coordinates from protein splice alighnmnet (GFF format)
  --et_score    [number]; 10 (default) minimum score of intron in initiation of the ET algorithm
  --ep_score    [number]; 4 (default) minimum score of intron in initiation of the EP algorithm
  --evidence    [filename]; to use in prediction external evidence (RNA or protein) mapped to genome
  --training    to run only training step
  --prediction  to run only prediction step
  --predict_with [filename]; predict genes using this file species specific parameters (bypass regular training and prediction steps)
 
Sequence pre-processing options
  --max_contig  [number]; 5000000 (default) will split input genomic sequence into contigs shorter then max_contig
  --min_contig  [number]; 50000 (default); will ignore contigs shorter then min_contig in training
  --max_gap      [number]; 5000 (default); will split sequence at gaps longer than max_gap
                Letters 'n' and 'N' are interpreted as standing within gaps
  --max_mask    [number]; 5000 (default); will split sequence at repeats longer then max_mask
                Letters 'x' and 'X' are interpreted as results of hard masking of repeats
  --soft_mask    [number] to indicate that lowercase letters stand for repeats; utilize only lowercase repeats longer than specified length
 
Run options
  --cores        [number]; 1 (default) to run program with multiple threads
  --pbs          to run on cluster with PBS support
  --v            verbose
 
Customizing parameters:
  --max_intron          [number]; default 10000 (3000 fungi), maximum length of intron
  --max_intergenic      [number]; default 10000, maximum length of intergenic regions
  --min_gene_prediction [number]; default 300 (120 fungi) minimum allowed gene length in prediction step
 
Developer options:
  --usr_cfg      [filename]; to customize configuration file
  --ini_mod      [filename]; use this file with parameters for algorithm initiation
  --test_set    [filename]; to evaluate prediction accuracy on the given test set
  --key_bin
  --debug
# -------------------
</pre>
</pre>
[[#top|Back to Top]]
[[#top|Back to Top]]

Latest revision as of 11:20, 15 August 2018

Category

Bioinformatics

Program On

Teaching

Version

4.33

Author / Distributor

GeneMarkES

Description

" Gene Prediction in Eukaryotes. Novel genomes can be analyzed by the program GeneMark-ES utilizing unsupervised training." More details are at GeneMarkES

Running Program

The last version of this application is at /usr/local/apps/gb/genemarkes/4.33

To use this version, please load the module with

ml genemarkes/4.33 

Here is an example of a shell script, sub.sh, to run on the batch queue:

#!/bin/bash
#SBATCH --job-name=j_GeneMarkES
#SBATCH --partition=batch
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@uga.edu
#SBATCH --ntasks=1
#SBATCH --mem=10gb
#SBATCH --time=08:00:00
#SBATCH --output=GeneMarkES.%j.out
#SBATCH --error=GeneMarkES.%j.err

cd $SLURM_SUBMIT_DIR
ml genemarkes/4.33
perl /usr/local/apps/gb/genemarkes/4.33/gmes_petap.pl [options]

In the real submission script, at least all the above underlined values need to be reviewed or to be replaced by the proper values.

Please refer to Running_Jobs_on_the_teaching_cluster, Run X window Jobs and Run interactive Jobs for more details of running jobs at Teaching cluster.


Here is an example of job submission command:

sbatch ./sub.sh 

Documentation

ml genemarkes/4.33 
perl /usr/local/apps/gb/genemarkes/4.33/gmes_petap.pl 
# -------------------
Usage:  /usr/local/apps/gb/genemarkes/4.33/gmes_petap.pl  [options]  --sequence [filename]

GeneMark-ES Suite version 4.35
   includes transcript (GeneMark-ET) and protein (GeneMark-EP) based training and prediction

Input sequence/s should be in FASTA format

Algorithm options
  --ES           to run self-training
  --fungus       to run algorithm with branch point model (most useful for fungal genomes)
  --ET           [filename]; to run training with introns coordinates from RNA-Seq read alignments (GFF format)
  --EP           [filename]; to run training with introns coordinates from protein splice alighnmnet (GFF format)
  --et_score     [number]; 10 (default) minimum score of intron in initiation of the ET algorithm
  --ep_score     [number]; 4 (default) minimum score of intron in initiation of the EP algorithm
  --evidence     [filename]; to use in prediction external evidence (RNA or protein) mapped to genome
  --training     to run only training step
  --prediction   to run only prediction step
  --predict_with [filename]; predict genes using this file species specific parameters (bypass regular training and prediction steps)

Sequence pre-processing options
  --max_contig   [number]; 5000000 (default) will split input genomic sequence into contigs shorter then max_contig
  --min_contig   [number]; 50000 (default); will ignore contigs shorter then min_contig in training 
  --max_gap      [number]; 5000 (default); will split sequence at gaps longer than max_gap
                 Letters 'n' and 'N' are interpreted as standing within gaps 
  --max_mask     [number]; 5000 (default); will split sequence at repeats longer then max_mask
                 Letters 'x' and 'X' are interpreted as results of hard masking of repeats
  --soft_mask    [number] to indicate that lowercase letters stand for repeats; utilize only lowercase repeats longer than specified length

Run options
  --cores        [number]; 1 (default) to run program with multiple threads 
  --pbs          to run on cluster with PBS support
  --v            verbose

Customizing parameters:
  --max_intron          [number]; default 10000 (3000 fungi), maximum length of intron
  --max_intergenic      [number]; default 10000, maximum length of intergenic regions
  --min_gene_prediction [number]; default 300 (120 fungi) minimum allowed gene length in prediction step

Developer options:
  --usr_cfg      [filename]; to customize configuration file
  --ini_mod      [filename]; use this file with parameters for algorithm initiation
  --test_set     [filename]; to evaluate prediction accuracy on the given test set
  --key_bin
  --debug
# -------------------

Back to Top

Installation

Source code is obtained from GeneMarkES

System

64-bit Linux