Bioinformatics Databases: Difference between revisions
No edit summary |
|||
(10 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
As part of our services, the GACRC builds and hosts local copies of frequently cited application data, and provides assistance for sharing data among GACRC members. Datasets are located in the commonly shared "/db" filesystem. [[BLAST Databases-Sapelo2 | NCBI BLAST datasets]] are pre-formatted to work with BLAST and BLAST+ and are located in "/db/ncbiblast/" and are organized by date. | As part of our services, the GACRC builds and hosts local copies of frequently cited application data, and provides assistance for sharing data among GACRC members. Datasets are located in the commonly shared "/db" filesystem. [[BLAST Databases-Sapelo2 | NCBI BLAST datasets]] are pre-formatted to work with BLAST and BLAST+ and are located in "/db/ncbiblast/" and are organized by date. | ||
NCBI BLAST Datasets can be loaded in | NCBI BLAST Datasets can be loaded in the same way as software modules. This allows users to replicate results by always being able to use a time stamped version of a database. The time stamp corresponds to the download date of the dataset. To search for available NCBI BLAST dataset modules run the command: | ||
<code>ml spider ncbiblastdb</code> | <code>ml spider ncbiblastdb</code> | ||
You can then load a database by running the command module load. For example to load the databases timestamped at | You can then load a database by running the command module load. For example to load the databases timestamped at 02/02/2024: | ||
<code>module load ncbiblastdb/ | <code>module load ncbiblastdb/20240202</code> | ||
Loading this module sets the environment variable BLASTDB to /db/ncbiblast/ | Loading this module sets the environment variable BLASTDB to /db/ncbiblast/20240202. This allows BLAST+ to search that directory for databases. You can then use the name of the database you would like to use.(nt, nr, etc.) Here is an example of a shell script to run BLAST+ on the batch queue using the database nt: | ||
<div class="gscript2"> | <div class="gscript2"> | ||
Line 24: | Line 24: | ||
cd $SLURM_SUBMIT_DIR<br> | cd $SLURM_SUBMIT_DIR<br> | ||
module load BLAST+/2. | module load BLAST+/2.13.0-gompi-2022a<br> | ||
module load ncbiblastdb/ | module load ncbiblastdb/20240202<br> | ||
blastn -query example.fasta -out results.out -db nt<br> | blastn -query example.fasta -out results.out -db nt<br> | ||
</div> | </div> | ||
Line 31: | Line 31: | ||
In your actual submission script, use your own discretion for the Slurm header values. | In your actual submission script, use your own discretion for the Slurm header values. | ||
NCBI's nr and nt datasets are the most commonly used at the GACRC. GACRC staff will update these datasets every | NCBI's nr and nt datasets are the most commonly used at the GACRC. GACRC staff will update these datasets '''every 3 months''' as well as other NCBI BLAST datasets: cdd_delta, human_genome, mouse_genome, nrte, refseq_protein, refseq_rna, swissprot, and taxdb datasets. | ||
Various subject datasets, e.g. pfam, bowtie indexes of human and mouse, and NCBI bacterial datasets, are hosted as well. However, these datasets are not accessed with a module. These datasets either don’t need updating, the source does not update frequently, or the datasets are not used frequently by our users. These datasets will only be updated by user request. | Various subject datasets, e.g. pfam, bowtie indexes of human and mouse, and NCBI bacterial datasets, are hosted as well. However, these datasets are not accessed with a module. These datasets either don’t need updating, the source does not update frequently, or the datasets are not used frequently by our users. These datasets will only be updated by user request. | ||
Line 49: | Line 49: | ||
cd $SLURM_SUBMIT_DIR<br> | cd $SLURM_SUBMIT_DIR<br> | ||
module load BLAST+/2. | module load BLAST+/2.13.0-gompi-2022a<br> | ||
blastn -query example.fasta -out results.out -db /db/ncbiblast/refseq_microbial/fasta/09162020/ <br> | blastn -query example.fasta -out results.out -db /db/ncbiblast/refseq_microbial/fasta/09162020/ <br> | ||
</div> | </div> | ||
Line 74: | Line 74: | ||
| [[bowtie2|hg19,mm10 & mm9]] || 07/29/2016 || no | | [[bowtie2|hg19,mm10 & mm9]] || 07/29/2016 || no | ||
|- | |- | ||
| | |bowtie2 | ||
|09/12/2024 | |||
|yes | |||
|- | |- | ||
| | | cellranger (GRCh38 and MM10) || 12/19/2021 || yes | ||
|- | |- | ||
| | |CheckV | ||
|01/10/2023 (v1.5) | |||
|yes | |||
|- | |- | ||
| deconseq || 12/ | | conspred_ressources || 02/02/2017 || yes | ||
|- | |||
| dammit || 03/17/2021 (v1.2) || yes | |||
|- | |||
| deconseq || 12/19/2021 ("04242013") || yes | |||
|- | |- | ||
| decontaMiner-Sapelo2|decontaMiner || 08/29/2019 || no | | decontaMiner-Sapelo2|decontaMiner || 08/29/2019 || no | ||
Line 86: | Line 94: | ||
| eggnog-mapper || 04/16/2019 || no | | eggnog-mapper || 04/16/2019 || no | ||
|- | |- | ||
| funannotate || | |foldseek | ||
|04/11/2023 | |||
|yes | |||
|- | |||
| funannotate || 12/19/2021 || yes | |||
|- | |- | ||
| DTDB-Tk || 01/15/2021 || no | | DTDB-Tk || 01/15/2021 || no | ||
Line 94: | Line 106: | ||
| kegg || 09/09/2016 || no | | kegg || 09/09/2016 || no | ||
|- | |- | ||
| maker || 05/22/2018 || | |Kraken2 (PlusPF, EuPathDB) | ||
|05/26/2024 | |||
|yes | |||
|- | |||
| maker || 05/22/2018 (v2.31.9) || yes | |||
|- | |- | ||
| MetaCLADE || 08/29/2019 || no | | MetaCLADE || 08/29/2019 || no | ||
|- | |- | ||
| [[BLAST Databases-Sapelo2 | NCBI BLAST Databases]] || beginning of every | | [[BLAST Databases-Sapelo2 |NCBI BLAST Databases]] || beginning of every 3 months || yes | ||
|- | |- | ||
| | |NCBI-FCS | ||
|12/01/2023 (v0.5.0) | |||
|yes | |||
|- | |- | ||
| | | [[NCBI Fasta-Sapelo2|NCBI Fasta]] || 04/06/2022 || yes | ||
|- | |- | ||
| nndb || | | ngs (canine and mouse) || 11/03/2016 || yes | ||
|- | |||
| nndb || 05/03/2008 || yes | |||
|- | |- | ||
| PB || 04/24/2019 || no | | PB || 04/24/2019 || no | ||
Line 110: | Line 130: | ||
| [[PfamDB-Sapelo2|pfam]] || 10/07/2021 || no | | [[PfamDB-Sapelo2|pfam]] || 10/07/2021 || no | ||
|- | |- | ||
| | | phylosift || 07/20/2020 || no | ||
|- | |- | ||
| [[Refseq-Sapelo2|Refseq_genomic]] || || no | | [[Refseq-Sapelo2|Refseq_genomic]] || || no | ||
Line 116: | Line 136: | ||
| repbase || 01/09/2017 || no | | repbase || 01/09/2017 || no | ||
|- | |- | ||
| rfam || 02/ | | rfam || 02/01/2017 (v12.2) || yes | ||
|- | |- | ||
| seqdb || 05/ | | seqdb || 05/22/2012 || yes | ||
|- | |- | ||
| sortmerna || | | sortmerna || 03/04/2016 || no | ||
|- | |- | ||
| [[TaxDB-Sapelo2| | | [[TaxDB-Sapelo2|taxdb]] || 02/23/2019 || yes | ||
|- | |- | ||
| topcons2 || 02/18/2021 || no | | topcons2 || 02/18/2021 || no | ||
|- | |- | ||
| [[Uniprot-Sapelo2|Uniprot]] || | | [[Uniprot-Sapelo2|Uniprot]] || 06/15/2023 || yes | ||
|- | |||
| [[Uniref-Sapelo2|Uniref]] || 02/25/2022 || yes | |||
|- | |- | ||
| | |virulencefinder | ||
|05/06/2024 | |||
|yes | |||
|- | |- | ||
| [[wublast-Sapelo2|wublast]] || | | [[wublast-Sapelo2|wublast]] || 04/01/2020 || no | ||
|- | |- | ||
|} | |} |
Latest revision as of 14:22, 13 November 2024
As part of our services, the GACRC builds and hosts local copies of frequently cited application data, and provides assistance for sharing data among GACRC members. Datasets are located in the commonly shared "/db" filesystem. NCBI BLAST datasets are pre-formatted to work with BLAST and BLAST+ and are located in "/db/ncbiblast/" and are organized by date.
NCBI BLAST Datasets can be loaded in the same way as software modules. This allows users to replicate results by always being able to use a time stamped version of a database. The time stamp corresponds to the download date of the dataset. To search for available NCBI BLAST dataset modules run the command:
ml spider ncbiblastdb
You can then load a database by running the command module load. For example to load the databases timestamped at 02/02/2024:
module load ncbiblastdb/20240202
Loading this module sets the environment variable BLASTDB to /db/ncbiblast/20240202. This allows BLAST+ to search that directory for databases. You can then use the name of the database you would like to use.(nt, nr, etc.) Here is an example of a shell script to run BLAST+ on the batch queue using the database nt:
#!/bin/bash
#SBATCH --job-name=j_BLAST+
#SBATCH --partition=batch
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@uga.edu
#SBATCH --ntasks=1
#SBATCH --mem=10gb
#SBATCH --time=08:00:00
#SBATCH --output=BLAST+.%j.out
#SBATCH --error=BLAST+.%j.err
cd $SLURM_SUBMIT_DIR
module load BLAST+/2.13.0-gompi-2022a
module load ncbiblastdb/20240202
blastn -query example.fasta -out results.out -db nt
In your actual submission script, use your own discretion for the Slurm header values.
NCBI's nr and nt datasets are the most commonly used at the GACRC. GACRC staff will update these datasets every 3 months as well as other NCBI BLAST datasets: cdd_delta, human_genome, mouse_genome, nrte, refseq_protein, refseq_rna, swissprot, and taxdb datasets.
Various subject datasets, e.g. pfam, bowtie indexes of human and mouse, and NCBI bacterial datasets, are hosted as well. However, these datasets are not accessed with a module. These datasets either don’t need updating, the source does not update frequently, or the datasets are not used frequently by our users. These datasets will only be updated by user request. Instead of using a module, to use these databases, use the path to the databases in your script. Below is an example using BLAST+ with a dataset not available as a module.
#!/bin/bash
#SBATCH --job-name=j_BLAST+
#SBATCH --partition=batch
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@uga.edu
#SBATCH --ntasks=1
#SBATCH --mem=10gb
#SBATCH --time=08:00:00
#SBATCH --output=BLAST+.%j.out
#SBATCH --error=BLAST+.%j.err
cd $SLURM_SUBMIT_DIR
module load BLAST+/2.13.0-gompi-2022a
blastn -query example.fasta -out results.out -db /db/ncbiblast/refseq_microbial/fasta/09162020/
For datasets requested by individual lab groups, GACRC encourages users to maintain their own copies of databases, whose files can be installed in their group's shared "/work" area. Similarly for datasets frequently updated by their source, the GACRC encourages users to maintain their own copies of these public databases.
The set of data which are regularly updated is open to review, and can be expanded based on available GACRC resources.
Installed Bioinformatics Databases
Name | Version | Module Available |
---|---|---|
akblab | 01/28/2018 | no |
Bacteria NCBI | 12/21/2017 | no |
hg19,mm10 & mm9 | 07/29/2016 | no |
bowtie2 | 09/12/2024 | yes |
cellranger (GRCh38 and MM10) | 12/19/2021 | yes |
CheckV | 01/10/2023 (v1.5) | yes |
conspred_ressources | 02/02/2017 | yes |
dammit | 03/17/2021 (v1.2) | yes |
deconseq | 12/19/2021 ("04242013") | yes |
decontaMiner | 08/29/2019 | no |
eggnog-mapper | 04/16/2019 | no |
foldseek | 04/11/2023 | yes |
funannotate | 12/19/2021 | yes |
DTDB-Tk | 01/15/2021 | no |
hg | 09/20/2016 | no |
kegg | 09/09/2016 | no |
Kraken2 (PlusPF, EuPathDB) | 05/26/2024 | yes |
maker | 05/22/2018 (v2.31.9) | yes |
MetaCLADE | 08/29/2019 | no |
NCBI BLAST Databases | beginning of every 3 months | yes |
NCBI-FCS | 12/01/2023 (v0.5.0) | yes |
NCBI Fasta | 04/06/2022 | yes |
ngs (canine and mouse) | 11/03/2016 | yes |
nndb | 05/03/2008 | yes |
PB | 04/24/2019 | no |
pfam | 10/07/2021 | no |
phylosift | 07/20/2020 | no |
Refseq_genomic | no | |
repbase | 01/09/2017 | no |
rfam | 02/01/2017 (v12.2) | yes |
seqdb | 05/22/2012 | yes |
sortmerna | 03/04/2016 | no |
taxdb | 02/23/2019 | yes |
topcons2 | 02/18/2021 | no |
Uniprot | 06/15/2023 | yes |
Uniref | 02/25/2022 | yes |
virulencefinder | 05/06/2024 | yes |
wublast | 04/01/2020 | no |