GeneMarkES-Sapelo2: Difference between revisions

Latest revision as of 12:23, 24 February 2025

Program On

Sapelo2

Version

4.71

Author / Distributor

Description

"Gene Prediction in Eukaryotes. Novel genomes can be analyzed by the program GeneMark-ES utilizing unsupervised training." More details are at GeneMarkES

Running Program

Also refer to Running Jobs on Sapelo2

In order to use geneMarker you will need to download a key file and put it into your home directory. Please follow instructions as given below:

1) Go to http://topaz.gatech.edu/GeneMark/license_download.cgi, fill the requested fields are read the license text. Note: you can select any tool and platform actually, e.g. GeneMark-ES/ET/EP+ ver 4.72_lic and LINUX64 kernel 3.10-5.

2) After pressing "I agree ..." button you will be redirected to a download page. You will need to download the 64bit key file only. The key file gm_key_64.gz will be downloaded to your local drive. Then please transfer it to the cluster using a transfer node or Globus. Please refer to Transferring Files.

3) The key file downloaded is in gzip format, on the cluster please unpack it (with gunzip) and move it to ~/.gm_key (the name should be exactly like this, with a dot in the beginning), for example:

gunzip gm_key_64.gz
mv gm_key_64 ~/.gm_key

where ~ is your home directory. i.e /home/MyID. Once the .gm_key file has been placed in your home dir you should be able to run GeneMarker.

Version 4.71

Version 4.71, installed at

/apps/eb/GeneMark-ET/4.71-GCCcore-11.3.0/
/apps/eb/GeneMark-ET/4.71-GCCcore-11.2.0/

To use it, please load the module with:

module load GeneMark-ET/4.71-GCCcore-11.3.0

or

module load GeneMark-ET/4.71-GCCcore-11.2.0

Here is an example of a shell script sub.sh to run it at the batch queue:

#!/bin/bash
#SBATCH --job-name=geneMarkJob
#SBATCH --partition=batch
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@uga.edu
#SBATCH --ntasks=1
#SBATCH --mem10gb
#SBATCH --time=08:00:00
#SBATCH --output=log.%j.out
#SBATCH --error=log.%j.err

cd $SLURM_SUBMIT_DIR

module load GeneMark-ET/4.71-GCCcore-11.3.0

gmes_petap.pl [options]

In the real submission script, at least all the above underlined values need to be reviewed or to be replaced by the proper values.

Here is an example of job submission command:

sbatch ./sub.sh

Documentation

module load GeneMark-ET/4.71-GCCcore-11.3.0
gmes_petap.pl 

# -------------------
Usage:  /apps/eb/GeneMark-ET/4.71-GCCcore-11.3.0/gmes_petap.pl  [options]  --sequence [filename]

GeneMark-ES Suite version 4.71_lic
Suite includes GeneMark.hmm, GeneMark-ES, GeneMark-ET and GeneMark-EP algorithms.

Input sequence/s should be in FASTA format.

Select one of the gene prediction algorithms:

  To run GeneMark-ES self-training algorithm
    --ES

  To run GeneMark-ET with hints from transcriptome splice alignments
    --ET           [filename]; file with intron coordinates from RNA-Seq read splice alignment in GFF format
    --et_score     [number]; default 10; minimum score of intron in initiation of the ET algorithm

  To run GeneMark-EP with hints from protein splice alignments
    --EP           
    --dbep         [filename]; file with protein database in FASTA format
    --ep_score     [number,number]; default 4,0.25; minimum score of intron in initiation of the EP algorithm
    or
    --EP           [filename]; file with intron coordinates from protein splice alignment in GFF format

  To run GeneMark.hmm predictions using previously derived model
    --predict_with [filename]; file with species specific gene prediction parameters

  To run ES, ET or EP with branch point model. This option is most useful for fungal genomes
    --fungus

  To run hmm, ES, ET or EP in PLUS mode (prediction with hints)
    --evidence     [filename]; file with hints in GFF format

  To run algorithms with alternative genetic codes
    --gcode      [number]; default 1; supported 1 and 6/26

Output formatting options:
  --format       [label]; default GTF; output gene prediction in GTF of GFF3 format
  --work_dir     [folder name]; default current working directory .;

Masking option
  --soft_mask    [number] or [auto]; default auto; to indicate that lowercase letters stand for repeats;
                 algorithm hard masks only lowercase repeats longer than specified length
                 In 'auto' mode hard masking threshold is selected by algorithm based on the size of the input genome
  --mask_penalty [number] or [auto]; default 0.03;

Run options
  --cores        [number]; default 1; to run program with multiple threads
  --pbs          to run on cluster with PBS support
  --v            verbose

Optional sequence pre-processing parameters
  --max_contig   [number]; default 5000000; will split input genomic sequence into contigs shorter then max_contig
  --min_contig   [number]; default 50000 (10000 fungi); 
                 will ignore contigs shorter than min_contig in training 
  --max_gap      [number]; default 5000; will split sequence at gaps longer than max_gap
                 Letters 'n' and 'N' are interpreted as standing within gaps 
  --max_mask     [number]; default 5000; will split sequence at repeats longer then max_mask
                 Letters 'x' and 'X' are interpreted as results of hard masking of repeats

Optional parameters
  --max_intron            [number]; default 10000 (3000 fungi); maximum length of intron
  --max_intergenic        [number]; default 50000 (10000 fungi); maximum length of intergenic regions
  --min_contig_in_predict [number]; default 500; minimum allowed length of contig in prediction step
  --min_gene_in_predict   [number]; default 300 (120 fungi); minimum allowed gene length in prediction step
  --gc_donor              [value];  default 0.001; transition probability to GC donor in the range 0..1; 
                          'off' switches GC donor model OFF
  --gc3          [number]; GC3 cutoff in training for grasses

Developer options
  --training     to run only training step of algorithm; applicable to ES, ET or EP
  --prediction   to run only prediction step of algorithms using species parameters from previously executed training; applicable to ES, ET or EP
  --usr_cfg      [filename]; use custom configuration from this file
  --ini_mod      [filename]; use this file with parameters for algorithm initiation
  --key_bin
  --debug
# -------------------

Back to Top

Installation

source code from GeneMarkES

System

64-bit Linux

@@ Line 1: / Line 1: @@
-[[Category:Sapelo2old]][[Category:Software]][[Category:Bioinformatics]]
+[[Category:Sapelo2]][[Category:Software]][[Category:Bioinformatics]]
 === Category ===
@@ Line 9: / Line 9: @@
 === Version ===
-.57
+.71
 === Author / Distributor ===
 [http://exon.gatech.edu/GeneMark/ GeneMarkES]
@@ Line 22: / Line 21: @@
 Also refer to [[Running Jobs on Sapelo2]]
-Also refer to [[Running_Jobs_on_Sapelo2#Running_an_X-windows_application | Run X window Jobs]] and
-[[Running_Jobs_on_Sapelo2#Running_an_Interactive_Job | Run interactive Jobs]]
+In order to use geneMarker you will need to download a key file and put it into your home directory. Please follow instructions as given below:
+) Go to http://topaz.gatech.edu/GeneMark/license_download.cgi, fill the requested fields are read the license text. Note: you can select any tool and platform actually, e.g. GeneMark-ES/ET/EP+ ver 4.72_lic and LINUX64 kernel 3.10-5.
+) After pressing "I agree ..." button you will be redirected to a download page. You will need to download the 64bit key file only. The key file gm_key_64.gz will be downloaded to your local drive. Then please transfer it to the cluster using a transfer node or Globus. Please refer to [[Transferring Files]].
+) The key file downloaded is in gzip format, on the cluster please unpack it (with gunzip) and move it to ~/.gm_key (the name should be exactly like this, with a dot in the beginning), for example:
+<pre  class="gcommand">
+gunzip gm_key_64.gz
+mv gm_key_64 ~/.gm_key
+</pre>
+where ~ is your home directory. i.e /home/MyID. Once the .gm_key file has been placed in your home dir you should be able to run GeneMarker.
+'''Version 4.71'''
+Version 4.71, installed at
+* /apps/eb/GeneMark-ET/4.71-GCCcore-11.3.0/
+* /apps/eb/GeneMark-ET/4.71-GCCcore-11.2.0/
+To use it, please load the module with:
+<div class="gscript2">
+module load GeneMark-ET/4.71-GCCcore-11.3.0
+</div>
+or
+<div class="gscript2">
+module load GeneMark-ET/4.71-GCCcore-11.2.0
+</div>
+Here is an example of a shell script sub.sh to run it at the batch queue:
-'''Version 4.57'''
+<div class="gscript2">
+<nowiki>#</nowiki>!/bin/bash<br>
+<nowiki>#</nowiki>SBATCH --job-name=geneMarkJob<br>
+<nowiki>#</nowiki>SBATCH --partition=batch<br>
+<nowiki>#</nowiki>SBATCH --mail-type=ALL<br>
+<nowiki>#</nowiki>SBATCH --mail-user=<u>username@uga.edu</u><br>
+<nowiki>#</nowiki>SBATCH --ntasks=<u>1</u><br>
+<nowiki>#</nowiki>SBATCH --mem<u>10gb</u><br>
+<nowiki>#</nowiki>SBATCH --time=<u>08:00:00</u><br>
+<nowiki>#</nowiki>SBATCH --output=log.%j.out<br>
+<nowiki>#</nowiki>SBATCH --error=log.%j.err<br>
-Version 4.57 is at /usr/local/apps/gb/genemarkes/4.57
+cd $SLURM_SUBMIT_DIR<br>
-Here is an example of a shell script sub.sh to run on at the batch queue:
+module load GeneMark-ET/4.71-GCCcore-11.3.0
-<pre class="gscript">
-#PBS -S /bin/bash
-#PBS -N j_GeneMarkES
-#PBS -q batch
-#PBS -l nodes=1:ppn=1
-#PBS -l walltime=48:00:00
-#PBS -l mem=10gb
-cd $PBS_O_WORKDIR
-module load genemarkes/4.57-foss-2018a
-cp /usr/local/apps/gb/genemarkes/4.57/gm_key ~/.gm_key
 gmes_petap.pl [options]
-</pre>
+</div>
+In the real submission script, at least all the above underlined values need to be reviewed or to be replaced by the proper values.
+Here is an example of job submission command:
-Here is an example of job submission
 <pre  class="gcommand">
-qsub  ./sub.sh
+sbatch ./sub.sh
 </pre>
 === Documentation ===
-<pre  class="gcommand">
+<pre class="gcommand">
-module load genemarkes/4.33
+module load GeneMark-ET/4.71-GCCcore-11.3.0
-perl /usr/local/apps/gb/genemarkes/4.33/gmes_petap.pl
+gmes_petap.pl
 # -------------------
-Usage:  /usr/local/apps/gb/genemarkes/4.33/gmes_petap.pl  [options]  --sequence [filename]
+Usage:  /apps/eb/GeneMark-ET/4.71-GCCcore-11.3.0/gmes_petap.pl  [options]  --sequence [filename]
+GeneMark-ES Suite version 4.71_lic
+Suite includes GeneMark.hmm, GeneMark-ES, GeneMark-ET and GeneMark-EP algorithms.
+Input sequence/s should be in FASTA format.
+Select one of the gene prediction algorithms:
+  To run GeneMark-ES self-training algorithm
+    --ES
+  To run GeneMark-ET with hints from transcriptome splice alignments
+    --ET           [filename]; file with intron coordinates from RNA-Seq read splice alignment in GFF format
+    --et_score     [number]; default 10; minimum score of intron in initiation of the ET algorithm
-GeneMark-ES Suite version 4.35
+  To run GeneMark-EP with hints from protein splice alignments
-   includes transcript (GeneMark-ET) and protein (GeneMark-EP) based training and prediction
+    --EP
+    --dbep         [filename]; file with protein database in FASTA format
+    --ep_score     [number,number]; default 4,0.25; minimum score of intron in initiation of the EP algorithm
+    or
+    --EP           [filename]; file with intron coordinates from protein splice alignment in GFF format
-Input sequence/s should be in FASTA format
+  To run GeneMark.hmm predictions using previously derived model
+    --predict_with [filename]; file with species specific gene prediction parameters
-Algorithm options
+   To run ES, ET or EP with branch point model. This option is most useful for fungal genomes
-   --ES           to run self-training
+     --fungus
-  --fungus       to run algorithm with branch point model (most useful for fungal genomes)
-  --ET           [filename]; to run training with introns coordinates from RNA-Seq read alignments (GFF format)
-  --EP           [filename]; to run training with introns coordinates from protein splice alighnmnet (GFF format)
-  --et_score     [number]; 10 (default) minimum score of intron in initiation of the ET algorithm
-  --ep_score     [number]; 4 (default) minimum score of intron in initiation of the EP algorithm
-  --evidence     [filename]; to use in prediction external evidence (RNA or protein) mapped to genome
-  --training     to run only training step
-  --prediction   to run only prediction step
-  --predict_with [filename]; predict genes using this file species specific parameters (bypass regular training and prediction steps)
-Sequence pre-processing options
+  To run hmm, ES, ET or EP in PLUS mode (prediction with hints)
-  --max_contig   [number]; 5000000 (default) will split input genomic sequence into contigs shorter then max_contig
+    --evidence     [filename]; file with hints in GFF format
-   --min_contig   [number]; 50000 (default); will ignore contigs shorter then min_contig in training
-   --max_gap      [number]; 5000 (default); will split sequence at gaps longer than max_gap
+   To run algorithms with alternative genetic codes
-                 Letters 'n' and 'N' are interpreted as standing within gaps
+    --gcode      [number]; default 1; supported 1 and 6/26
-   --max_mask     [number]; 5000 (default); will split sequence at repeats longer then max_mask
-                 Letters 'x' and 'X' are interpreted as results of hard masking of repeats
+Output formatting options:
-   --soft_mask    [number] to indicate that lowercase letters stand for repeats; utilize only lowercase repeats longer than specified length
+   --format       [label]; default GTF; output gene prediction in GTF of GFF3 format
+   --work_dir     [folder name]; default current working directory .;
+Masking option
+   --soft_mask    [number] or [auto]; default auto; to indicate that lowercase letters stand for repeats;
+                 algorithm hard masks only lowercase repeats longer than specified length
+                 In 'auto' mode hard masking threshold is selected by algorithm based on the size of the input genome
+  --mask_penalty [number] or [auto]; default 0.03;
 Run options
-   --cores        [number]; 1 (default) to run program with multiple threads
+   --cores        [number]; default 1; to run program with multiple threads
    --pbs          to run on cluster with PBS support
    --v            verbose
-Customizing parameters:
+Optional sequence pre-processing parameters
-   --max_intron          [number]; default 10000 (3000 fungi), maximum length of intron
+  --max_contig   [number]; default 5000000; will split input genomic sequence into contigs shorter then max_contig
-   --max_intergenic      [number]; default 10000, maximum length of intergenic regions
+  --min_contig   [number]; default 50000 (10000 fungi);
-   --min_gene_prediction [number]; default 300 (120 fungi) minimum allowed gene length in prediction step
+                 will ignore contigs shorter than min_contig in training
+  --max_gap      [number]; default 5000; will split sequence at gaps longer than max_gap
+                 Letters 'n' and 'N' are interpreted as standing within gaps
+  --max_mask     [number]; default 5000; will split sequence at repeats longer then max_mask
+                 Letters 'x' and 'X' are interpreted as results of hard masking of repeats
+Optional parameters
+   --max_intron            [number]; default 10000 (3000 fungi); maximum length of intron
+   --max_intergenic        [number]; default 50000 (10000 fungi); maximum length of intergenic regions
+   --min_contig_in_predict [number]; default 500; minimum allowed length of contig in prediction step
+  --min_gene_in_predict   [number]; default 300 (120 fungi); minimum allowed gene length in prediction step
+  --gc_donor              [value];  default 0.001; transition probability to GC donor in the range 0..1;
+                          'off' switches GC donor model OFF
+  --gc3          [number]; GC3 cutoff in training for grasses
-Developer options:
+Developer options
-   --usr_cfg      [filename]; to customize configuration file
+  --training     to run only training step of algorithm; applicable to ES, ET or EP
+  --prediction   to run only prediction step of algorithms using species parameters from previously executed training; applicable to ES, ET or EP
+   --usr_cfg      [filename]; use custom configuration from this file
    --ini_mod      [filename]; use this file with parameters for algorithm initiation
-  --test_set     [filename]; to evaluate prediction accuracy on the given test set
    --key_bin
    --debug

GeneMarkES-Sapelo2: Difference between revisions

Latest revision as of 12:23, 24 February 2025

Contents

Category

Program On

Version

Author / Distributor

Description

Running Program

Documentation

Installation

System

Navigation menu

GeneMarkES-Sapelo2: Difference between revisions

Latest revision as of 12:23, 24 February 2025

Category

Program On

Version

Author / Distributor

Description

Running Program

Documentation

Installation

System

Navigation menu

Search