BLAST Databases-Sapelo2: Difference between revisions

From Research Computing Center Wiki
Jump to navigation Jump to search
No edit summary
(Updated to reflect the new version of the databases created on 11/01/2024)
 
(9 intermediate revisions by 5 users not shown)
Line 1: Line 1:
[[Category:Sapelo2]][[Category:Software]][[Category:Bioinformatics]][[Category:Bioinformatics Database]]     
[[Category:Sapelo2]][[Category:Software]][[Category:Bioinformatics]][[Category:Bioinformatics Database]]     


'''Blast database - version 4'''
'''Blast databases'''


Each database is located at a directory with name pattern as /db/ncbiblast/''dbname''/''month-date-year'', where ''dbname'' is the name of the database, such as nr, nt, taxdb, refseq_rna, refseq_protein, mouse_genome, human_genome, swissprot, etc.  
Blast databases are pre-formatted to work with blast commands from BLAST and BLAST+. These databases are similar to what you could build using the makeblastdb command with FASTA files. Each database is located at a directory with the name pattern /db/ncbiblast/YYYYMMDD. For example, databases downloaded on February 01, 2022 will be available in /db/ncbiblast/20220201. Starting with the download on August 2nd, 2024 (i.e., /db/ncbiblast/20240802), the following databases are currently available and stored together with databases downloaded at the same time:
<pre class="gscript">
nr
nt
taxdb
refseq_rna
refseq_protein
mouse_genome
human_genome
swissprot
cdd_delta
env_nr
env_nt
tsa_nr
tsa_nt
taxdb
ref_prok_rep_genomes
ref_euk_rep_genomes
ref_viroids_rep_genomes
ref_viruses_rep_genomes
16S_ribosomal_RNA
18S_fungal_sequences
28S_fungal_sequences
LSU_eukaryote_rRNA
LSU_prokaryote_rRNA
SSU_eukaryote_rRNA
Betacoronavirus
ITS_RefSeq_Fungi
ITS_eukaryote_sequences
mito
</pre>
 
Note that the database refseq_genomic is no longer available to download as a pre-formatted database. 
 
 
 
NCBI BLAST Datasets can be loaded just like software modules. This allows users to replicate results by always being able to use a time stamped version of a database. The time stamp corresponds to the download date of the dataset. To search for available NCBI BLAST dataset modules run the command:
 
<code>ml spider ncbiblastdb</code>
 
You can then load a database by running the command module load. For example to load the databases timestamped on 11/02/2024:
 
<code>module load ncbiblastdb/20241102</code>
 
Loading this module sets the environment variable BLASTDB to /db/ncbiblast/20241102. This allows BLAST+ to search that directory for databases. You can then use the name of the database you would like to use.(nt, nr, etc.) Here is an example of a shell script to run BLAST+ on the batch queue using the database nt:
<div class="gscript2">
<nowiki>#</nowiki>!/bin/bash<br>
<nowiki>#</nowiki>SBATCH --job-name=j_BLAST+<br>
<nowiki>#</nowiki>SBATCH --partition=batch<br>       
<nowiki>#</nowiki>SBATCH --mail-type=ALL<br>
<nowiki>#</nowiki>SBATCH --mail-user=<u>username@uga.edu</u><br> 
<nowiki>#</nowiki>SBATCH --ntasks=<u>1</u><br> 
<nowiki>#</nowiki>SBATCH --mem=<u>10gb</u><br>   
<nowiki>#</nowiki>SBATCH --time=<u>08:00:00</u><br> 
<nowiki>#</nowiki>SBATCH --output=BLAST+.%j.out<br>
<nowiki>#</nowiki>SBATCH --error=BLAST+.%j.err<br>
cd $SLURM_SUBMIT_DIR<br>
module load BLAST+/2.14.1-gompi-2023a<br> 
module load ncbiblastdb/20241102<br> 
blastn -query example.fasta -out results.out -db nt<br> 
</div>


For example, an nr database downloaded on June 04, 2020 will be available in /db/ncbiblast/nr/06042020.
In your actual submission script, use your own discretion for the Slurm header values.      


Available databases:
NCBI's nr and nt datasets are the most commonly used at the GACRC. GACRC staff will update these datasets '''every 3 months''' along with other potentially helpful NCBI BLAST datasets.
<pre class="gscript">
/db/ncbiblast/human_genome/06042020
/db/ncbiblast/mouse_genome/06042020
/db/ncbiblast/nr/06042020
/db/ncbiblast/nt/06042020
/db/ncbiblast/refseq_rna/06042020
/db/ncbiblast/refseq_protein/06042020
/db/ncbiblast/swissprot/06042020
/db/ncbiblast/taxdb/06042020
</pre>




'''Blast database - version 5'''
'''Blast database - old version 5'''


Each database is located at a directory with name pattern as /db/ncbiblast.v5/''dbname'', where ''dbname'' is the name of the database, such as nr, nt, taxdb, refseq_rna, swissprot, etc.  
According to https://ncbiinsights.ncbi.nlm.nih.gov/tag/blastdbv5/, around Feb. 04, 2020 NCBI dropped the _v5 suffix for the v5 Blast databases.


For example, an nr database downloaded on June 04, 2020 will be available in /db/ncbiblast.v5/nr
An old version of the v5 database (download in June 2019) is located at a directory with name pattern as /db/ncbiblast.v5/''dbname'', where ''dbname'' is the name of the database, such as nr, nt, taxdb, refseq_rna, swissprot, etc.


These old versions of the nr and nt databases still have the _v5 suffix, i.e. they are called nr_v5 and nt_v5, respectively.


[[#top|Back to Top]]
[[#top|Back to Top]]

Latest revision as of 13:43, 1 November 2024


Blast databases

Blast databases are pre-formatted to work with blast commands from BLAST and BLAST+. These databases are similar to what you could build using the makeblastdb command with FASTA files. Each database is located at a directory with the name pattern /db/ncbiblast/YYYYMMDD. For example, databases downloaded on February 01, 2022 will be available in /db/ncbiblast/20220201. Starting with the download on August 2nd, 2024 (i.e., /db/ncbiblast/20240802), the following databases are currently available and stored together with databases downloaded at the same time:

nr
nt
taxdb
refseq_rna
refseq_protein
mouse_genome
human_genome
swissprot
cdd_delta
env_nr
env_nt
tsa_nr
tsa_nt
taxdb
ref_prok_rep_genomes
ref_euk_rep_genomes
ref_viroids_rep_genomes
ref_viruses_rep_genomes
16S_ribosomal_RNA
18S_fungal_sequences
28S_fungal_sequences
LSU_eukaryote_rRNA
LSU_prokaryote_rRNA
SSU_eukaryote_rRNA
Betacoronavirus
ITS_RefSeq_Fungi
ITS_eukaryote_sequences
mito

Note that the database refseq_genomic is no longer available to download as a pre-formatted database.


NCBI BLAST Datasets can be loaded just like software modules. This allows users to replicate results by always being able to use a time stamped version of a database. The time stamp corresponds to the download date of the dataset. To search for available NCBI BLAST dataset modules run the command:

ml spider ncbiblastdb

You can then load a database by running the command module load. For example to load the databases timestamped on 11/02/2024:

module load ncbiblastdb/20241102

Loading this module sets the environment variable BLASTDB to /db/ncbiblast/20241102. This allows BLAST+ to search that directory for databases. You can then use the name of the database you would like to use.(nt, nr, etc.) Here is an example of a shell script to run BLAST+ on the batch queue using the database nt:

#!/bin/bash
#SBATCH --job-name=j_BLAST+
#SBATCH --partition=batch
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@uga.edu
#SBATCH --ntasks=1
#SBATCH --mem=10gb
#SBATCH --time=08:00:00
#SBATCH --output=BLAST+.%j.out
#SBATCH --error=BLAST+.%j.err

cd $SLURM_SUBMIT_DIR
module load BLAST+/2.14.1-gompi-2023a
module load ncbiblastdb/20241102
blastn -query example.fasta -out results.out -db nt

In your actual submission script, use your own discretion for the Slurm header values.

NCBI's nr and nt datasets are the most commonly used at the GACRC. GACRC staff will update these datasets every 3 months along with other potentially helpful NCBI BLAST datasets.


Blast database - old version 5

According to https://ncbiinsights.ncbi.nlm.nih.gov/tag/blastdbv5/, around Feb. 04, 2020 NCBI dropped the _v5 suffix for the v5 Blast databases.

An old version of the v5 database (download in June 2019) is located at a directory with name pattern as /db/ncbiblast.v5/dbname, where dbname is the name of the database, such as nr, nt, taxdb, refseq_rna, swissprot, etc.

These old versions of the nr and nt databases still have the _v5 suffix, i.e. they are called nr_v5 and nt_v5, respectively.

Back to Top