BLAST Databases-Sapelo2: Difference between revisions
No edit summary |
(Updated to reflect the new version of the databases created on 11/01/2024) |
||
(9 intermediate revisions by 5 users not shown) | |||
Line 1: | Line 1: | ||
[[Category:Sapelo2]][[Category:Software]][[Category:Bioinformatics]][[Category:Bioinformatics Database]] | [[Category:Sapelo2]][[Category:Software]][[Category:Bioinformatics]][[Category:Bioinformatics Database]] | ||
'''Blast | '''Blast databases''' | ||
Each database is located at a directory with name pattern | Blast databases are pre-formatted to work with blast commands from BLAST and BLAST+. These databases are similar to what you could build using the makeblastdb command with FASTA files. Each database is located at a directory with the name pattern /db/ncbiblast/YYYYMMDD. For example, databases downloaded on February 01, 2022 will be available in /db/ncbiblast/20220201. Starting with the download on August 2nd, 2024 (i.e., /db/ncbiblast/20240802), the following databases are currently available and stored together with databases downloaded at the same time: | ||
<pre class="gscript"> | |||
nr | |||
nt | |||
taxdb | |||
refseq_rna | |||
refseq_protein | |||
mouse_genome | |||
human_genome | |||
swissprot | |||
cdd_delta | |||
env_nr | |||
env_nt | |||
tsa_nr | |||
tsa_nt | |||
taxdb | |||
ref_prok_rep_genomes | |||
ref_euk_rep_genomes | |||
ref_viroids_rep_genomes | |||
ref_viruses_rep_genomes | |||
16S_ribosomal_RNA | |||
18S_fungal_sequences | |||
28S_fungal_sequences | |||
LSU_eukaryote_rRNA | |||
LSU_prokaryote_rRNA | |||
SSU_eukaryote_rRNA | |||
Betacoronavirus | |||
ITS_RefSeq_Fungi | |||
ITS_eukaryote_sequences | |||
mito | |||
</pre> | |||
Note that the database refseq_genomic is no longer available to download as a pre-formatted database. | |||
NCBI BLAST Datasets can be loaded just like software modules. This allows users to replicate results by always being able to use a time stamped version of a database. The time stamp corresponds to the download date of the dataset. To search for available NCBI BLAST dataset modules run the command: | |||
<code>ml spider ncbiblastdb</code> | |||
You can then load a database by running the command module load. For example to load the databases timestamped on 11/02/2024: | |||
<code>module load ncbiblastdb/20241102</code> | |||
Loading this module sets the environment variable BLASTDB to /db/ncbiblast/20241102. This allows BLAST+ to search that directory for databases. You can then use the name of the database you would like to use.(nt, nr, etc.) Here is an example of a shell script to run BLAST+ on the batch queue using the database nt: | |||
<div class="gscript2"> | |||
<nowiki>#</nowiki>!/bin/bash<br> | |||
<nowiki>#</nowiki>SBATCH --job-name=j_BLAST+<br> | |||
<nowiki>#</nowiki>SBATCH --partition=batch<br> | |||
<nowiki>#</nowiki>SBATCH --mail-type=ALL<br> | |||
<nowiki>#</nowiki>SBATCH --mail-user=<u>username@uga.edu</u><br> | |||
<nowiki>#</nowiki>SBATCH --ntasks=<u>1</u><br> | |||
<nowiki>#</nowiki>SBATCH --mem=<u>10gb</u><br> | |||
<nowiki>#</nowiki>SBATCH --time=<u>08:00:00</u><br> | |||
<nowiki>#</nowiki>SBATCH --output=BLAST+.%j.out<br> | |||
<nowiki>#</nowiki>SBATCH --error=BLAST+.%j.err<br> | |||
cd $SLURM_SUBMIT_DIR<br> | |||
module load BLAST+/2.14.1-gompi-2023a<br> | |||
module load ncbiblastdb/20241102<br> | |||
blastn -query example.fasta -out results.out -db nt<br> | |||
</div> | |||
In your actual submission script, use your own discretion for the Slurm header values. | |||
NCBI's nr and nt datasets are the most commonly used at the GACRC. GACRC staff will update these datasets '''every 3 months''' along with other potentially helpful NCBI BLAST datasets. | |||
'''Blast database - version 5''' | '''Blast database - old version 5''' | ||
According to https://ncbiinsights.ncbi.nlm.nih.gov/tag/blastdbv5/, around Feb. 04, 2020 NCBI dropped the _v5 suffix for the v5 Blast databases. | |||
An old version of the v5 database (download in June 2019) is located at a directory with name pattern as /db/ncbiblast.v5/''dbname'', where ''dbname'' is the name of the database, such as nr, nt, taxdb, refseq_rna, swissprot, etc. | |||
These old versions of the nr and nt databases still have the _v5 suffix, i.e. they are called nr_v5 and nt_v5, respectively. | |||
[[#top|Back to Top]] | [[#top|Back to Top]] |
Latest revision as of 13:43, 1 November 2024
Blast databases
Blast databases are pre-formatted to work with blast commands from BLAST and BLAST+. These databases are similar to what you could build using the makeblastdb command with FASTA files. Each database is located at a directory with the name pattern /db/ncbiblast/YYYYMMDD. For example, databases downloaded on February 01, 2022 will be available in /db/ncbiblast/20220201. Starting with the download on August 2nd, 2024 (i.e., /db/ncbiblast/20240802), the following databases are currently available and stored together with databases downloaded at the same time:
nr nt taxdb refseq_rna refseq_protein mouse_genome human_genome swissprot cdd_delta env_nr env_nt tsa_nr tsa_nt taxdb ref_prok_rep_genomes ref_euk_rep_genomes ref_viroids_rep_genomes ref_viruses_rep_genomes 16S_ribosomal_RNA 18S_fungal_sequences 28S_fungal_sequences LSU_eukaryote_rRNA LSU_prokaryote_rRNA SSU_eukaryote_rRNA Betacoronavirus ITS_RefSeq_Fungi ITS_eukaryote_sequences mito
Note that the database refseq_genomic is no longer available to download as a pre-formatted database.
NCBI BLAST Datasets can be loaded just like software modules. This allows users to replicate results by always being able to use a time stamped version of a database. The time stamp corresponds to the download date of the dataset. To search for available NCBI BLAST dataset modules run the command:
ml spider ncbiblastdb
You can then load a database by running the command module load. For example to load the databases timestamped on 11/02/2024:
module load ncbiblastdb/20241102
Loading this module sets the environment variable BLASTDB to /db/ncbiblast/20241102. This allows BLAST+ to search that directory for databases. You can then use the name of the database you would like to use.(nt, nr, etc.) Here is an example of a shell script to run BLAST+ on the batch queue using the database nt:
#!/bin/bash
#SBATCH --job-name=j_BLAST+
#SBATCH --partition=batch
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@uga.edu
#SBATCH --ntasks=1
#SBATCH --mem=10gb
#SBATCH --time=08:00:00
#SBATCH --output=BLAST+.%j.out
#SBATCH --error=BLAST+.%j.err
cd $SLURM_SUBMIT_DIR
module load BLAST+/2.14.1-gompi-2023a
module load ncbiblastdb/20241102
blastn -query example.fasta -out results.out -db nt
In your actual submission script, use your own discretion for the Slurm header values.
NCBI's nr and nt datasets are the most commonly used at the GACRC. GACRC staff will update these datasets every 3 months along with other potentially helpful NCBI BLAST datasets.
Blast database - old version 5
According to https://ncbiinsights.ncbi.nlm.nih.gov/tag/blastdbv5/, around Feb. 04, 2020 NCBI dropped the _v5 suffix for the v5 Blast databases.
An old version of the v5 database (download in June 2019) is located at a directory with name pattern as /db/ncbiblast.v5/dbname, where dbname is the name of the database, such as nr, nt, taxdb, refseq_rna, swissprot, etc.
These old versions of the nr and nt databases still have the _v5 suffix, i.e. they are called nr_v5 and nt_v5, respectively.