InterProScan-Sapelo2: Difference between revisions

Revision as of 12:40, 6 September 2023

Program On

Sapelo2

Version

5.63-95.0

Author / Distributor

To cite:

Sarah Hunter, Rolf Apweiler, Teresa K. Attwood, Amos Bairoch, Alex Bateman, David Binns, Peer Bork, Ujjwal Das, Louise Daugherty, Lauranne Duquenne, Robert D. Finn, Julian Gough, Daniel Haft, Nicolas Hulo, Daniel Kahn, Elizabeth Kelly, Aurélie Laugraud, Ivica Letunic, David Lonsdale, Rodrigo Lopez, Martin Madera, John Maslen, Craig McAnulla, Jennifer McDowall, Jaina Mistry, Alex Mitchell, Nicola Mulder, Darren Natale, Christine Orengo, Antony F. Quinn, Jeremy D. Selengut, Christian J. A. Sigrist, Manjula Thimma, Paul D. Thomas, Franck Valentin, Derek Wilson, Cathy H. Wu, and Corin Yeats (2009) InterPro: the integrative protein signature database

Nucleic Acids Res. 37 (Database Issue):D211-D215.

Description

"InterProScan is a tool that combines different protein signature recognition methods into one resource. The number of signature databases and their associated scanning tools, as well as the further refinement procedures, increases the complexity of the problem." More details are at EMBL-EBI.

Running Program

Also refer to Running Jobs on Sapelo2

For more information on Environment Modules on Sapelo2 please see the Lmod page.

Version 5.63-95.0, installed in /apps/eb/InterProScan/5.63-95.0-foss-2022a
panther data 17.0, installed in /apps/eb/InterProScan/5.63-95.0-foss-2022a/data/panther/17.0

To use version 5.63-95.0, please first load the module with

module load InterProScan/5.63-95.0-foss-2022a

Please note: Please use full paths to your files in your working folder; Otherwise, InterProScan will try to access and write to /app/eb. This will cause IO permission error.

Sample job submission script (sub.sh) to run interproscan.sh of version 5.63-95.0 in a batch job:

#!/bin/bash
#SBATCH --job-name=job_InterProScan
#SBATCH --partition=batch            
#SBATCH --ntasks=1                  	
#SBATCH --cpus-per-task=4
#SBATCH --mem=20gb                    
#SBATCH --time=120:00:00           
#SBATCH --output=log.%j.out     
#SBATCH --error=log.%j.err          
#SBATCH --mail-user=username@uga.edu  
#SBATCH --mail-type=ALL   

cd $SLURM_SUBMIT_DIR

module load InterProScan/5.63-95.0-foss-2022a

sh interproscan.sh [options]

where [options] need to be replaced by the options (command and arguments) you want to use. Other parameters of the job, such as the maximum wall clock time, maximum memory, the number of cores per node, and the job name need to be modified appropriately as well.

Please skip -cpu option. You can use the queueing system option --cpus-per-task to request more than one core within a single node. InterProScan will be able to use those cores. You can also use the -dp option to disable the use of the precalculated match lookup service.

Example of job submission

sbatch sub.sh

Documentation

User guide is available at InterProScan Wiki

To get help info:

[cft07037@d2-13 all]$ ml InterProScan/5.63-95.0-foss-2022a
[cft07037@d2-13 all]$ sh interproscan.sh
06/09/2023 13:38:59:316 Welcome to InterProScan-5.63-95.0
06/09/2023 13:38:59:317 Running InterProScan v5 in STANDALONE mode... on Linux
usage: java -XX:+UseParallelGC -XX:ParallelGCThreads=2 -XX:+AggressiveOpts -XX:+UseFastAccessorMethods -Xms128M
-Xmx2048M -jar interproscan-5.jar

Please give us your feedback by sending an email to

interhelp@ebi.ac.uk

-appl,--applications <ANALYSES> Optional, comma separated list of analyses. If this option
is not set, ALL analyses will be run.
-b,--output-file-base <OUTPUT-FILE-BASE> Optional, base output filename (relative or absolute path).
Note that this option, the --output-dir (-d) option and the
--outfile (-o) option are mutually exclusive. The
appropriate file extension for the output format(s) will be
appended automatically. By default the input file path/name
will be used.
-cpu,--cpu <CPU> Optional, number of cores for inteproscan.
-d,--output-dir <OUTPUT-DIR> Optional, output directory. Note that this option, the
--outfile (-o) option and the --output-file-base (-b) option
are mutually exclusive. The output filename(s) are the same
as the input filename, with the appropriate file extension(s)
for the output format(s) appended automatically .
-dp,--disable-precalc Optional. Disables use of the precalculated match lookup
service. All match calculations will be run locally.
-dra,--disable-residue-annot Optional, excludes sites from the XML, JSON output
-etra,--enable-tsv-residue-annot Optional, includes sites in TSV output
-exclappl,--excl-applications <EXC-ANALYSES> Optional, comma separated list of analyses you want to
exclude.
-f,--formats <OUTPUT-FORMATS> Optional, case-insensitive, comma separated list of output
formats. Supported formats are TSV, XML, JSON, and GFF3.
Default for protein sequences are TSV, XML and GFF3, or for
nucleotide sequences GFF3 and XML.
-goterms,--goterms Optional, switch on lookup of corresponding Gene Ontology
annotation (IMPLIES -iprlookup option)
-help,--help Optional, display help information
-i,--input <INPUT-FILE-PATH> Optional, path to fasta file that should be loaded on Master
startup. Alternatively, in CONVERT mode, the InterProScan 5
XML file to convert.
-incldepappl,--incl-dep-applications <INC-DEP-ANALYSES> Optional, comma separated list of deprecated analyses that
you want included. If this option is not set, deprecated
analyses will not run.
-iprlookup,--iprlookup Also include lookup of corresponding InterPro annotation in
the TSV and GFF3 output formats.
-ms,--minsize <MINIMUM-SIZE> Optional, minimum nucleotide size of ORF to report. Will only
be considered if n is specified as a sequence type. Please be
aware of the fact that if you specify a too short value it
might be that the analysis takes a very long time!
-o,--outfile <EXPLICIT_OUTPUT_FILENAME> Optional explicit output file name (relative or absolute
path). Note that this option, the --output-dir (-d) option
and the --output-file-base (-b) option are mutually
exclusive. If this option is given, you MUST specify a single
output format using the -f option. The output file name will
not be modified. Note that specifying an output file name
using this option OVERWRITES ANY EXISTING FILE.
-pa,--pathways Optional, switch on lookup of corresponding Pathway
annotation (IMPLIES -iprlookup option)
-t,--seqtype <SEQUENCE-TYPE> Optional, the type of the input sequences (dna/rna (n) or
protein (p)). The default sequence type is protein.
-T,--tempdir <TEMP-DIR> Optional, specify temporary file directory (relative or
absolute path). The default location is temp/.
-verbose,--verbose Optional, display more verbose log output
-version,--version Optional, display version number
-vl,--verbose-level <VERBOSE-LEVEL> Optional, display verbose log output at level specified.
-vtsv,--output-tsv-version Optional, includes a TSV version file along with any TSV
output (when TSV output requested)
Copyright © EMBL European Bioinformatics Institute, Hinxton, Cambridge, UK. (http://www.ebi.ac.uk) The InterProScan
software itself is provided under the Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0.html).
Third party components (e.g. member database binaries and models) are subject to separate licensing - please see the
individual member database websites for details.

Available analyses:
FunFam (4.3.0) : Prediction of functional annotations for novel, uncharacterized sequences.
SFLD (4) : SFLD is a database of protein families based on hidden Markov models (HMMs).
Phobius (1.01) : A combined transmembrane topology and signal peptide predictor.
SignalP_GRAM_NEGATIVE (4.1) : SignalP (gram-negative) predicts the presence and location of signal peptide cleavage sites in amino acid sequences for gram-negative prokaryotes.
PANTHER (17.0) : The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System is a unique resource that classifies genes by their functions, using published scientific experimental evidence and evolutionary relationships to predict function even in the absence of direct experimental evidence.
Gene3D (4.3.0) : Structural assignment for whole genes and genomes using the CATH domain structure database.
Hamap (2023_01) : High-quality Automated and Manual Annotation of Microbial Proteomes.
PRINTS (42.0) : A compendium of protein fingerprints - a fingerprint is a group of conserved motifs used to characterise a protein family.
ProSiteProfiles (2022_05) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them.
Coils (2.2.1) : Prediction of coiled coil regions in proteins.
SUPERFAMILY (1.75) : SUPERFAMILY is a database of structural and functional annotations for all proteins and genomes.
SMART (9.0) : SMART allows the identification and analysis of domain architectures based on hidden Markov models (HMMs).
CDD (3.20) : CDD predicts protein domains and families based on a collection of well-annotated multiple sequence alignment models.
PIRSR (2021_05) : PIRSR is a database of protein families based on hidden Markov models (HMMs) and Site Rules.
ProSitePatterns (2022_05) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them.
AntiFam (7.0) : AntiFam is a resource of profile-HMMs designed to identify spurious protein predictions.
SignalP_EUK (4.1) : SignalP (eukaryotes) predicts the presence and location of signal peptide cleavage sites in amino acid sequences for eukaryotes.
Pfam (35.0) : A large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs).
MobiDBLite (2.0) : Prediction of intrinsically disordered regions in proteins.
SignalP_GRAM_POSITIVE (4.1) : SignalP (gram-positive) predicts the presence and location of signal peptide cleavage sites in amino acid sequences for gram-positive prokaryotes.
PIRSF (3.10) : The PIRSF concept is used as a guiding principle to provide comprehensive and non-overlapping clustering of UniProtKB sequences into a hierarchical order to reflect their evolutionary relationships.
TMHMM (2.0c) : Prediction of transmembrane helices in proteins.
NCBIfam (12.0) : NCBIfam is a collection of protein families based on Hidden Markov Models (HMMs).

Back to Top

Installation

Tarball was downloaded from http://www.ebi.ac.uk/interpro/download/

System

64-bit Linux

@@ Line 10: / Line 10: @@
 === Version ===
-.44-90.0, 5.51-85.0
+.63-95.0
 === Author / Distributor ===
@@ Line 33: / Line 33: @@
 For more information on Environment Modules on Sapelo2 please see the [[Lmod]] page.
-* Version 5.44-90.0, installed in /apps/eb/InterProScan/5.44-79.0-foss-2019b
+* Version 5.63-95.0, installed in /apps/eb/InterProScan/5.63-95.0-foss-2022a
-* panther data 14.1, installed in /apps/eb/InterProScan/5.44-79.0-foss-2019b/data/panther/14.1
+* panther data 17.0, installed in /apps/eb/InterProScan/5.63-95.0-foss-2022a/data/panther/17.0
-* Version 5.51-85.0, installed in /apps/eb/InterProScan/5.51-85.0-foss-2019b
-* panther data 15.0, installed in /apps/eb/InterProScan/5.51-85.0-foss-2019b/data/panther/15.0
+To use version 5.63-95.0, please first load the module with
-To use version 5.44-90.0, please first load the module with
 <pre class="gscript">
-module load InterProScan/5.44-79.0-foss-2019b
+module load InterProScan/5.63-95.0-foss-2022a
-</pre>
-To use version 5.51-85.0, please first load the module with
-<pre class="gscript">
-module load InterProScan/5.51-85.0-foss-2019b
 </pre>
@@ Line 53: / Line 45: @@
-Sample job submission script (sub.sh) to run interproscan.sh of version 5.51-85.0 in a batch job:
+Sample job submission script (sub.sh) to run interproscan.sh of version 5.63-95.0 in a batch job:
 <pre class="gscript">
@@ Line 70: / Line 62: @@
 cd $SLURM_SUBMIT_DIR
-module load InterProScan/5.51-85.0-foss-2019b
+module load InterProScan/5.63-95.0-foss-2022a
 sh interproscan.sh [options]
@@ Line 81: / Line 73: @@
 Example of job submission
-<pre  class="gcommand">
+<pre class="gcommand">
-sbacth sub.sh
+sbatch sub.sh
 </pre>
@@ Line 89: / Line 81: @@
 To get help info:
-<pre class="gcommand">
+<pre class="gcommand">[cft07037@d2-13 all]$ ml InterProScan/5.63-95.0-foss-2022a
-ml InterProScan/5.51-85.0-foss-2019b
+[cft07037@d2-13 all]$ sh interproscan.sh
-sh interproscan.sh
+/09/2023 13:38:59:316 Welcome to InterProScan-5.63-95.0
+/09/2023 13:38:59:317 Running InterProScan v5 in STANDALONE mode... on Linux
-/04/2021 10:38:27:539 Welcome to InterProScan-5.51-85.0
-/04/2021 10:38:27:542 Running InterProScan v5 in STANDALONE mode... on Linux
 usage: java -XX:+UseParallelGC -XX:ParallelGCThreads=2 -XX:+AggressiveOpts -XX:+UseFastAccessorMethods -Xms128M
              -Xmx2048M -jar interproscan-5.jar
@@ Line 124: / Line 114: @@
                                                             exclude.
   -f,--formats <OUTPUT-FORMATS>                             Optional, case-insensitive, comma separated list of output
-                                                            formats. Supported formats are TSV, XML, JSON, GFF3, HTML and
+                                                            formats. Supported formats are TSV, XML, JSON, and GFF3.
-                                                            SVG. Default for protein sequences are TSV, XML and GFF3, or
+                                                            Default for protein sequences are TSV, XML and GFF3, or for
-                                                            for nucleotide sequences GFF3 and XML.
+                                                            nucleotide sequences GFF3 and XML.
   -goterms,--goterms                                        Optional, switch on lookup of corresponding Gene Ontology
                                                             annotation (IMPLIES -iprlookup option)
@@ Line 166: / Line 156: @@
 Available analyses:
-                      TIGRFAM (15.0) : TIGRFAMs are protein families based on hidden Markov models (HMMs).
+                       FunFam (4.3.0) : Prediction of functional annotations for novel, uncharacterized sequences.
                           SFLD (4) : SFLD is a database of protein families based on hidden Markov models (HMMs).
                        Phobius (1.01) : A combined transmembrane topology and signal peptide predictor.
          SignalP_GRAM_NEGATIVE (4.1) : SignalP (gram-negative) predicts the presence and location of signal peptide cleavage sites in amino acid sequences for gram-negative prokaryotes.
-                  SUPERFAMILY (1.75) : SUPERFAMILY is a database of structural and functional annotations for all proteins and genomes.
+                       PANTHER (17.0) : The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System is a unique resource that classifies genes by their functions, using published scientific experimental evidence and evolutionary relationships to predict function even in the absence of direct experimental evidence.
-                       PANTHER (15.0) : The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System is a unique resource that classifies genes by their functions, using published scientific experimental evidence and evolutionary relationships to predict function even in the absence of direct experimental evidence.
                         Gene3D (4.3.0) : Structural assignment for whole genes and genomes using the CATH domain structure database.
-                         Hamap (2020_05) : High-quality Automated and Manual Annotation of Microbial Proteomes.
+                         Hamap (2023_01) : High-quality Automated and Manual Annotation of Microbial Proteomes.
-               ProSiteProfiles (2019_11) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them.
+                       PRINTS (42.0) : A compendium of protein fingerprints - a fingerprint is a group of conserved motifs used to characterise a protein family.
+               ProSiteProfiles (2022_05) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them.
                          Coils (2.2.1) : Prediction of coiled coil regions in proteins.
-                         SMART (7.1) : SMART allows the identification and analysis of domain architectures based on hidden Markov models (HMMs).
+                  SUPERFAMILY (1.75) : SUPERFAMILY is a database of structural and functional annotations for all proteins and genomes.
-                           CDD (3.18) : CDD predicts protein domains and families based on a collection of well-annotated multiple sequence alignment models.
+                         SMART (9.0) : SMART allows the identification and analysis of domain architectures based on hidden Markov models (HMMs).
-                       PRINTS (42.0) : A compendium of protein fingerprints - a fingerprint is a group of conserved motifs used to characterise a protein family.
+                           CDD (3.20) : CDD predicts protein domains and families based on a collection of well-annotated multiple sequence alignment models.
-                         PIRSR (2021_02) : PIRSR is a database of protein families based on hidden Markov models (HMMs) and Site Rules.
+                         PIRSR (2021_05) : PIRSR is a database of protein families based on hidden Markov models (HMMs) and Site Rules.
-               ProSitePatterns (2019_11) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them.
+               ProSitePatterns (2022_05) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them.
+                      AntiFam (7.0) : AntiFam is a resource of profile-HMMs designed to identify spurious protein predictions.
                    SignalP_EUK (4.1) : SignalP (eukaryotes) predicts the presence and location of signal peptide cleavage sites in amino acid sequences for eukaryotes.
-                          Pfam (33.1) : A large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs).
+                          Pfam (35.0) : A large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs).
                     MobiDBLite (2.0) : Prediction of intrinsically disordered regions in proteins.
          SignalP_GRAM_POSITIVE (4.1) : SignalP (gram-positive) predicts the presence and location of signal peptide cleavage sites in amino acid sequences for gram-positive prokaryotes.
                          PIRSF (3.10) : The PIRSF concept is used as a guiding principle to provide comprehensive and non-overlapping clustering of UniProtKB sequences into a hierarchical order to reflect their evolutionary relationships.
                          TMHMM (2.0c) : Prediction of transmembrane helices in proteins.
+                      NCBIfam (12.0) : NCBIfam is a collection of protein families based on Hidden Markov Models (HMMs).
 </pre>
 [[#top|Back to Top]]

InterProScan-Sapelo2: Difference between revisions

Revision as of 12:40, 6 September 2023

Contents

Category

Program On

Version

Author / Distributor

Description

Running Program

Documentation

Installation

System

Navigation menu

InterProScan-Sapelo2: Difference between revisions

Revision as of 12:40, 6 September 2023

Category

Program On

Version

Author / Distributor

Description

Running Program

Documentation

Installation

System

Navigation menu

Search