GeneMarkES-Teaching: Difference between revisions

From Research Computing Center Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
 
(One intermediate revision by the same user not shown)
Line 41: Line 41:
<nowiki>#</nowiki>SBATCH --time=<u>08:00:00</u><br>   
<nowiki>#</nowiki>SBATCH --time=<u>08:00:00</u><br>   
<nowiki>#</nowiki>SBATCH --output=GeneMarkES.%j.out<br>
<nowiki>#</nowiki>SBATCH --output=GeneMarkES.%j.out<br>
<nowiki>#</nowiki>SBATCH --error=GeneMarkES.%j.err<br>
   
   
cd $SLURM_SUBMIT_DIR<br>
cd $SLURM_SUBMIT_DIR<br>
Line 61: Line 62:
ml genemarkes/4.33  
ml genemarkes/4.33  
perl /usr/local/apps/gb/genemarkes/4.33/gmes_petap.pl  
perl /usr/local/apps/gb/genemarkes/4.33/gmes_petap.pl  
[http://exon.gatech.edu/GeneMark/ GeneMarkES]
# -------------------
Usage: /usr/local/apps/gb/genemarkes/4.33/gmes_petap.pl  [options]  --sequence [filename]
 
GeneMark-ES Suite version 4.35
  includes transcript (GeneMark-ET) and protein (GeneMark-EP) based training and prediction
 
Input sequence/s should be in FASTA format
 
Algorithm options
  --ES          to run self-training
  --fungus      to run algorithm with branch point model (most useful for fungal genomes)
  --ET          [filename]; to run training with introns coordinates from RNA-Seq read alignments (GFF format)
  --EP          [filename]; to run training with introns coordinates from protein splice alighnmnet (GFF format)
  --et_score    [number]; 10 (default) minimum score of intron in initiation of the ET algorithm
  --ep_score    [number]; 4 (default) minimum score of intron in initiation of the EP algorithm
  --evidence    [filename]; to use in prediction external evidence (RNA or protein) mapped to genome
  --training    to run only training step
  --prediction  to run only prediction step
  --predict_with [filename]; predict genes using this file species specific parameters (bypass regular training and prediction steps)
 
Sequence pre-processing options
  --max_contig  [number]; 5000000 (default) will split input genomic sequence into contigs shorter then max_contig
  --min_contig  [number]; 50000 (default); will ignore contigs shorter then min_contig in training
  --max_gap      [number]; 5000 (default); will split sequence at gaps longer than max_gap
                Letters 'n' and 'N' are interpreted as standing within gaps
  --max_mask    [number]; 5000 (default); will split sequence at repeats longer then max_mask
                Letters 'x' and 'X' are interpreted as results of hard masking of repeats
  --soft_mask    [number] to indicate that lowercase letters stand for repeats; utilize only lowercase repeats longer than specified length
 
Run options
  --cores        [number]; 1 (default) to run program with multiple threads
  --pbs          to run on cluster with PBS support
  --v            verbose
 
Customizing parameters:
  --max_intron          [number]; default 10000 (3000 fungi), maximum length of intron
  --max_intergenic      [number]; default 10000, maximum length of intergenic regions
  --min_gene_prediction [number]; default 300 (120 fungi) minimum allowed gene length in prediction step
 
Developer options:
  --usr_cfg      [filename]; to customize configuration file
  --ini_mod      [filename]; use this file with parameters for algorithm initiation
  --test_set    [filename]; to evaluate prediction accuracy on the given test set
  --key_bin
  --debug
# -------------------
</pre>
</pre>
[[#top|Back to Top]]
[[#top|Back to Top]]

Latest revision as of 12:20, 15 August 2018

Category

Bioinformatics

Program On

Teaching

Version

4.33

Author / Distributor

GeneMarkES

Description

" Gene Prediction in Eukaryotes. Novel genomes can be analyzed by the program GeneMark-ES utilizing unsupervised training." More details are at GeneMarkES

Running Program

The last version of this application is at /usr/local/apps/gb/genemarkes/4.33

To use this version, please load the module with

ml genemarkes/4.33 

Here is an example of a shell script, sub.sh, to run on the batch queue:

#!/bin/bash
#SBATCH --job-name=j_GeneMarkES
#SBATCH --partition=batch
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@uga.edu
#SBATCH --ntasks=1
#SBATCH --mem=10gb
#SBATCH --time=08:00:00
#SBATCH --output=GeneMarkES.%j.out
#SBATCH --error=GeneMarkES.%j.err

cd $SLURM_SUBMIT_DIR
ml genemarkes/4.33
perl /usr/local/apps/gb/genemarkes/4.33/gmes_petap.pl [options]

In the real submission script, at least all the above underlined values need to be reviewed or to be replaced by the proper values.

Please refer to Running_Jobs_on_the_teaching_cluster, Run X window Jobs and Run interactive Jobs for more details of running jobs at Teaching cluster.


Here is an example of job submission command:

sbatch ./sub.sh 

Documentation

ml genemarkes/4.33 
perl /usr/local/apps/gb/genemarkes/4.33/gmes_petap.pl 
# -------------------
Usage:  /usr/local/apps/gb/genemarkes/4.33/gmes_petap.pl  [options]  --sequence [filename]

GeneMark-ES Suite version 4.35
   includes transcript (GeneMark-ET) and protein (GeneMark-EP) based training and prediction

Input sequence/s should be in FASTA format

Algorithm options
  --ES           to run self-training
  --fungus       to run algorithm with branch point model (most useful for fungal genomes)
  --ET           [filename]; to run training with introns coordinates from RNA-Seq read alignments (GFF format)
  --EP           [filename]; to run training with introns coordinates from protein splice alighnmnet (GFF format)
  --et_score     [number]; 10 (default) minimum score of intron in initiation of the ET algorithm
  --ep_score     [number]; 4 (default) minimum score of intron in initiation of the EP algorithm
  --evidence     [filename]; to use in prediction external evidence (RNA or protein) mapped to genome
  --training     to run only training step
  --prediction   to run only prediction step
  --predict_with [filename]; predict genes using this file species specific parameters (bypass regular training and prediction steps)

Sequence pre-processing options
  --max_contig   [number]; 5000000 (default) will split input genomic sequence into contigs shorter then max_contig
  --min_contig   [number]; 50000 (default); will ignore contigs shorter then min_contig in training 
  --max_gap      [number]; 5000 (default); will split sequence at gaps longer than max_gap
                 Letters 'n' and 'N' are interpreted as standing within gaps 
  --max_mask     [number]; 5000 (default); will split sequence at repeats longer then max_mask
                 Letters 'x' and 'X' are interpreted as results of hard masking of repeats
  --soft_mask    [number] to indicate that lowercase letters stand for repeats; utilize only lowercase repeats longer than specified length

Run options
  --cores        [number]; 1 (default) to run program with multiple threads 
  --pbs          to run on cluster with PBS support
  --v            verbose

Customizing parameters:
  --max_intron          [number]; default 10000 (3000 fungi), maximum length of intron
  --max_intergenic      [number]; default 10000, maximum length of intergenic regions
  --min_gene_prediction [number]; default 300 (120 fungi) minimum allowed gene length in prediction step

Developer options:
  --usr_cfg      [filename]; to customize configuration file
  --ini_mod      [filename]; use this file with parameters for algorithm initiation
  --test_set     [filename]; to evaluate prediction accuracy on the given test set
  --key_bin
  --debug
# -------------------

Back to Top

Installation

Source code is obtained from GeneMarkES

System

64-bit Linux