Bioinformatics Databases: Difference between revisions

Revision as of 14:22, 13 November 2024

As part of our services, the GACRC builds and hosts local copies of frequently cited application data, and provides assistance for sharing data among GACRC members. Datasets are located in the commonly shared "/db" filesystem. NCBI BLAST datasets are pre-formatted to work with BLAST and BLAST+ and are located in "/db/ncbiblast/" and are organized by date.

NCBI BLAST Datasets can be loaded in the same way as software modules. This allows users to replicate results by always being able to use a time stamped version of a database. The time stamp corresponds to the download date of the dataset. To search for available NCBI BLAST dataset modules run the command:

ml spider ncbiblastdb

You can then load a database by running the command module load. For example to load the databases timestamped at 02/02/2024:

module load ncbiblastdb/20240202

Loading this module sets the environment variable BLASTDB to /db/ncbiblast/20240202. This allows BLAST+ to search that directory for databases. You can then use the name of the database you would like to use.(nt, nr, etc.) Here is an example of a shell script to run BLAST+ on the batch queue using the database nt:

#!/bin/bash
#SBATCH --job-name=j_BLAST+
#SBATCH --partition=batch
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@uga.edu
#SBATCH --ntasks=1
#SBATCH --mem=10gb
#SBATCH --time=08:00:00
#SBATCH --output=BLAST+.%j.out
#SBATCH --error=BLAST+.%j.err

cd $SLURM_SUBMIT_DIR
module load BLAST+/2.13.0-gompi-2022a
module load ncbiblastdb/20240202
blastn -query example.fasta -out results.out -db nt

In your actual submission script, use your own discretion for the Slurm header values.

NCBI's nr and nt datasets are the most commonly used at the GACRC. GACRC staff will update these datasets every 3 months as well as other NCBI BLAST datasets: cdd_delta, human_genome, mouse_genome, nrte, refseq_protein, refseq_rna, swissprot, and taxdb datasets.

Various subject datasets, e.g. pfam, bowtie indexes of human and mouse, and NCBI bacterial datasets, are hosted as well. However, these datasets are not accessed with a module. These datasets either don’t need updating, the source does not update frequently, or the datasets are not used frequently by our users. These datasets will only be updated by user request. Instead of using a module, to use these databases, use the path to the databases in your script. Below is an example using BLAST+ with a dataset not available as a module.

#!/bin/bash
#SBATCH --job-name=j_BLAST+
#SBATCH --partition=batch
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@uga.edu
#SBATCH --ntasks=1
#SBATCH --mem=10gb
#SBATCH --time=08:00:00
#SBATCH --output=BLAST+.%j.out
#SBATCH --error=BLAST+.%j.err

cd $SLURM_SUBMIT_DIR
module load BLAST+/2.13.0-gompi-2022a
blastn -query example.fasta -out results.out -db /db/ncbiblast/refseq_microbial/fasta/09162020/

For datasets requested by individual lab groups, GACRC encourages users to maintain their own copies of databases, whose files can be installed in their group's shared "/work" area. Similarly for datasets frequently updated by their source, the GACRC encourages users to maintain their own copies of these public databases.

The set of data which are regularly updated is open to review, and can be expanded based on available GACRC resources.

Installed Bioinformatics Databases

Name	Version	Module Available
akblab	01/28/2018	no
Bacteria NCBI	12/21/2017	no
hg19,mm10 & mm9	07/29/2016	no
bowtie2	09/12/2024	yes
cellranger (GRCh38 and MM10)	12/19/2021	yes
CheckV	01/10/2023 (v1.5)	yes
conspred_ressources	02/02/2017	yes
dammit	03/17/2021 (v1.2)	yes
deconseq	12/19/2021 ("04242013")	yes
decontaMiner	08/29/2019	no
eggnog-mapper	04/16/2019	no
foldseek	04/11/2023	yes
funannotate	12/19/2021	yes
DTDB-Tk	01/15/2021	no
hg	09/20/2016	no
kegg	09/09/2016	no
Kraken2 (PlusPF, EuPathDB)	05/26/2024	yes
maker	05/22/2018 (v2.31.9)	yes
MetaCLADE	08/29/2019	no
NCBI BLAST Databases	beginning of every 3 months	yes
NCBI-FCS	12/01/2023 (v0.5.0)	yes
NCBI Fasta	04/06/2022	yes
ngs (canine and mouse)	11/03/2016	yes
nndb	05/03/2008	yes
PB	04/24/2019	no
pfam	10/07/2021	no
phylosift	07/20/2020	no
Refseq_genomic		no
repbase	01/09/2017	no
rfam	02/01/2017 (v12.2)	yes
seqdb	05/22/2012	yes
sortmerna	03/04/2016	no
taxdb	02/23/2019	yes
topcons2	02/18/2021	no
Uniprot	06/15/2023	yes
Uniref	02/25/2022	yes
virulencefinder	05/06/2024	yes
wublast	04/01/2020	no

@@ Line 1: / Line 1: @@
-As part of our services, the GACRC builds and hosts local copies of frequently cited application data, and provides assistance for sharing data among GACRC members, in the commonly shared "/db" filesystem.
+As part of our services, the GACRC builds and hosts local copies of frequently cited application data, and provides assistance for sharing data among GACRC members.  Datasets are located in the commonly shared "/db" filesystem. [[BLAST Databases-Sapelo2 | NCBI BLAST datasets]] are pre-formatted to work with BLAST and BLAST+ and are located in "/db/ncbiblast/" and are organized by date.
-Of the public local databases in /db, NCBI's nr and nt datasets are the most commonly used at the GACRC. GACRC staff will update these datasets every other month in fasta format. From these updated datasets, we will also build NCBI Blast and  WUBlast databases in both nucleotide and protein formats.
+NCBI BLAST Datasets can be loaded in the same way as software modules. This allows users to replicate results by always being able to use a time stamped version of a database. The time stamp corresponds to the download date of the dataset. To search for available NCBI BLAST dataset modules run the command:
-Various subject datasets, e.g. pfam, bowtie indexes of human and mouse, and NCBI bacterial datasets, are hosted as well. These datasets either don’t need updating, the source does not update frequently, or the datasets are not used frequently by our users. These datasets will only be updated by user request.
+<code>ml spider ncbiblastdb</code>
-For datasets requested by individual lab groups, GACRC staff will assist in setting up a group-shared environment and request that group members maintain their database files there.
+You can then load a database by running the command module load. For example to load the databases timestamped at 02/02/2024:
-For datasets frequently updated by their source, the GACRC encourages users to maintain their own copies of these public databases.
+<code>module load ncbiblastdb/20240202</code>
+Loading this module sets the environment variable BLASTDB to /db/ncbiblast/20240202. This allows BLAST+ to search that directory for databases. You can then use the name of the database you would like to use.(nt, nr, etc.) Here is an example of a shell script to run BLAST+ on the batch queue using the database nt:
+<div class="gscript2">
+<nowiki>#</nowiki>!/bin/bash<br>
+<nowiki>#</nowiki>SBATCH --job-name=j_BLAST+<br>
+<nowiki>#</nowiki>SBATCH --partition=batch<br>
+<nowiki>#</nowiki>SBATCH --mail-type=ALL<br>
+<nowiki>#</nowiki>SBATCH --mail-user=<u>username@uga.edu</u><br>
+<nowiki>#</nowiki>SBATCH --ntasks=<u>1</u><br>
+<nowiki>#</nowiki>SBATCH --mem=<u>10gb</u><br>
+<nowiki>#</nowiki>SBATCH --time=<u>08:00:00</u><br>
+<nowiki>#</nowiki>SBATCH --output=BLAST+.%j.out<br>
+<nowiki>#</nowiki>SBATCH --error=BLAST+.%j.err<br>
+cd $SLURM_SUBMIT_DIR<br>
+module load BLAST+/2.13.0-gompi-2022a<br>
+module load ncbiblastdb/20240202<br>
+blastn -query example.fasta -out results.out -db nt<br>
+</div>
+In your actual submission script, use your own discretion for the Slurm header values.
+NCBI's nr and nt datasets are the most commonly used at the GACRC. GACRC staff will update these datasets '''every 3 months''' as well as other NCBI BLAST datasets: cdd_delta, human_genome, mouse_genome, nrte, refseq_protein, refseq_rna, swissprot, and taxdb datasets.
+Various subject datasets, e.g. pfam, bowtie indexes of human and mouse, and NCBI bacterial datasets, are hosted as well. However, these datasets are not accessed with a module. These datasets either don’t need updating, the source does not update frequently, or the datasets are not used frequently by our users. These datasets will only be updated by user request.
+Instead of using a module, to use these databases, use the path to the databases in your script. Below is an example using BLAST+ with a dataset not available as a module.
+<div class="gscript2">
+<nowiki>#</nowiki>!/bin/bash<br>
+<nowiki>#</nowiki>SBATCH --job-name=j_BLAST+<br>
+<nowiki>#</nowiki>SBATCH --partition=batch<br>
+<nowiki>#</nowiki>SBATCH --mail-type=ALL<br>
+<nowiki>#</nowiki>SBATCH --mail-user=<u>username@uga.edu</u><br>
+<nowiki>#</nowiki>SBATCH --ntasks=<u>1</u><br>
+<nowiki>#</nowiki>SBATCH --mem=<u>10gb</u><br>
+<nowiki>#</nowiki>SBATCH --time=<u>08:00:00</u><br>
+<nowiki>#</nowiki>SBATCH --output=BLAST+.%j.out<br>
+<nowiki>#</nowiki>SBATCH --error=BLAST+.%j.err<br>
+cd $SLURM_SUBMIT_DIR<br>
+module load BLAST+/2.13.0-gompi-2022a<br>
+blastn -query example.fasta -out results.out -db /db/ncbiblast/refseq_microbial/fasta/09162020/ <br>
+</div>
+For datasets requested by individual lab groups, GACRC encourages users to maintain their own copies of databases, whose files can be installed in their group's shared "/work" area.
+Similarly for datasets frequently updated by their source, the GACRC encourages users to maintain their own copies of these public databases.
 The set of data which are regularly updated is open to review, and can be expanded based on available GACRC resources.
@@ Line 17: / Line 66: @@
 ! scope="col" | Name
 ! scope="col" class="unsortable" | Version
-! scope="col" | Cluster
+! scope="col" | Module Available
+|-
+| akblab ||  01/28/2018 || no
+|-
+| [[Bacteria NCBI-sapelo2|Bacteria NCBI]] ||  12/21/2017 || no
+|-
+| [[bowtie2|hg19,mm10 & mm9]] ||  07/29/2016 || no
+|-
+|bowtie2
+|09/12/2024
+|yes
+|-
+| cellranger (GRCh38 and MM10) ||  12/19/2021 || yes
+|-
+|CheckV
+|01/10/2023 (v1.5)
+|yes
+|-
+| conspred_ressources ||  02/02/2017 || yes
+|-
+| dammit || 03/17/2021 (v1.2)  || yes
+|-
+| deconseq || 12/19/2021 ("04242013") || yes
+|-
+| decontaMiner-Sapelo2|decontaMiner || 08/29/2019  || no
+|-
+| eggnog-mapper || 04/16/2019  || no
+|-
+|foldseek
+|04/11/2023
+|yes
+|-
+| funannotate || 12/19/2021  || yes
+|-
+| DTDB-Tk || 01/15/2021  || no
+|-
+| hg || 09/20/2016  || no
+|-
+| kegg || 09/09/2016  || no
+|-
+|Kraken2 (PlusPF, EuPathDB)
+|05/26/2024
+|yes
+|-
+| maker || 05/22/2018 (v2.31.9)  || yes
+|-
+| MetaCLADE || 08/29/2019  || no
+|-
+| [[BLAST Databases-Sapelo2 |NCBI BLAST Databases]] || beginning of every 3 months || yes
+|-
+|NCBI-FCS
+|12/01/2023 (v0.5.0)
+|yes
+|-
+| [[NCBI Fasta-Sapelo2|NCBI Fasta]] || 04/06/2022 || yes
+|-
+| ngs (canine and mouse) || 11/03/2016 || yes
+|-
+| nndb || 05/03/2008  || yes
+|-
+| PB || 04/24/2019  || no
+|-
+| [[PfamDB-Sapelo2|pfam]] || 10/07/2021 || no
 |-
-| [[Bacteria NCBI-sapelo2|Bacteria NCBI]] ||  12/21/2017 || [[:Category:Sapelo2|Sapelo2]]
+| phylosift || 07/20/2020  || no
 |-
-| [[gss-Sapelo2|gss]] ||   || [[:Category:Sapelo2|Sapelo2]]
+| [[Refseq-Sapelo2|Refseq_genomic]] ||   || no
 |-
-| [[htgs-Sapelo2|htgs]] ||  || [[:Category:Sapelo2|Sapelo2]]
+| repbase || 01/09/2017  || no
 |-
-| [[hg19-Sapelo2 | hg]] ||  || [[:Category:Sapelo2|Sapelo2]]
+| rfam || 02/01/2017 (v12.2)  || yes
 |-
-| [[BLAST Databases-Sapelo2 | NCBI BLAST Database]] || every other month || [[:Category:Sapelo2|Sapelo2]]
+| seqdb || 05/22/2012  || yes
 |-
-| [[NCBI Fasta-Sapelo2|NCBI Fasta]] || every other month || [[:Category:Sapelo2|Sapelo2]]
+| sortmerna || 03/04/2016  || no
 |-
-| [[PfamDB-Sapelo2|pfam]] || 27.0  || [[:Category:Sapelo2|Sapelo2]]
+| [[TaxDB-Sapelo2|taxdb]] || 02/23/2019   || yes
 |-
-| [[Refseq-Sapelo2|Refseq]] ||   || [[:Category:Sapelo2|Sapelo2]]
+| topcons2 || 02/18/2021  || no
 |-
-| [[TaxDB-Sapelo2|TaxDB]] ||    || [[:Category:Sapelo2|Sapelo2]]
+| [[Uniprot-Sapelo2|Uniprot]] || 06/15/2023   || yes
 |-
-| [[Uniprot-Sapelo2|Uniprot]] || 06/28/2018   || [[:Category:Sapelo2|Sapelo2]]
+| [[Uniref-Sapelo2|Uniref]] || 02/25/2022 || yes
 |-
-| [[Uniref-Sapelo2|Uniref]] || 06/28/2018  || [[:Category:Sapelo2|Sapelo2]]
+|virulencefinder
+|05/06/2024
+|yes
 |-
-| [[wublast-Sapelo2|wublast]] ||    || [[:Category:Sapelo2|Sapelo2]]
+| [[wublast-Sapelo2|wublast]] ||  04/01/2020  || no
 |-
 |}

Bioinformatics Databases: Difference between revisions

Revision as of 14:22, 13 November 2024

Installed Bioinformatics Databases

Navigation menu

Search