BLAST Databases-Sapelo2: Difference between revisions
No edit summary |
(edits for clarity and typo in date) |
||
Line 3: | Line 3: | ||
'''Blast databases''' | '''Blast databases''' | ||
Blast databases are pre-formatted to work with blast commands from BLAST and BLAST+. These databases are similar to what you could build using the makeblastdb command with FASTA files. Each database is located at a directory with name pattern | Blast databases are pre-formatted to work with blast commands from BLAST and BLAST+. These databases are similar to what you could build using the makeblastdb command with FASTA files. Each database is located at a directory with the name pattern /db/ncbiblasty/year-month-date. For example, databases downloaded on February 01, 2022 will be available in /db/ncbiblast/20220201. The following databases are currently available and stored together with databases downloaded at the same time; nr, nt, taxdb, refseq_rna, refseq_protein, mouse_genome, human_genome and swissprot. The database refseq_genomic is no longer available to download as a pre-formatted database. | ||
NCBI BLAST Datasets can be loaded | |||
NCBI BLAST Datasets can be loaded just like software modules. This allows users to replicate results by always being able to use a time stamped version of a database. The time stamp corresponds to the download date of the dataset. To search for available NCBI BLAST dataset modules run the command: | |||
<code>ml spider ncbiblastdb</code> | <code>ml spider ncbiblastdb</code> | ||
Line 29: | Line 30: | ||
cd $SLURM_SUBMIT_DIR<br> | cd $SLURM_SUBMIT_DIR<br> | ||
module load BLAST+/2.9.0-gompi-2019b<br> | module load BLAST+/2.9.0-gompi-2019b<br> | ||
module load ncbiblastdb/ | module load ncbiblastdb/20220201<br> | ||
blastn -query example.fasta -out results.out -db nt<br> | blastn -query example.fasta -out results.out -db nt<br> | ||
</div> | </div> |
Revision as of 09:01, 17 February 2022
Blast databases
Blast databases are pre-formatted to work with blast commands from BLAST and BLAST+. These databases are similar to what you could build using the makeblastdb command with FASTA files. Each database is located at a directory with the name pattern /db/ncbiblasty/year-month-date. For example, databases downloaded on February 01, 2022 will be available in /db/ncbiblast/20220201. The following databases are currently available and stored together with databases downloaded at the same time; nr, nt, taxdb, refseq_rna, refseq_protein, mouse_genome, human_genome and swissprot. The database refseq_genomic is no longer available to download as a pre-formatted database.
NCBI BLAST Datasets can be loaded just like software modules. This allows users to replicate results by always being able to use a time stamped version of a database. The time stamp corresponds to the download date of the dataset. To search for available NCBI BLAST dataset modules run the command:
ml spider ncbiblastdb
You can then load a database by running the command module load. For example to load the databases timestamped at 06/16/2021:
module load ncbiblastdb/20210616
Loading this module sets the environment variable BLASTDB to /db/ncbiblast/20210616. This allows BLAST+ to search that directory for databases. You can then use the name of the database you would like to use.(nt, nr, etc.) Here is an example of a shell script to run BLAST+ on the batch queue using the database nt:
#!/bin/bash
#SBATCH --job-name=j_BLAST+
#SBATCH --partition=batch
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@uga.edu
#SBATCH --ntasks=1
#SBATCH --mem=10gb
#SBATCH --time=08:00:00
#SBATCH --output=BLAST+.%j.out
#SBATCH --error=BLAST+.%j.err
cd $SLURM_SUBMIT_DIR
module load BLAST+/2.9.0-gompi-2019b
module load ncbiblastdb/20220201
blastn -query example.fasta -out results.out -db nt
In your actual submission script, use your own discretion for the Slurm header values.
NCBI's nr and nt datasets are the most commonly used at the GACRC. GACRC staff will update these datasets every month as well as other NCBI BLAST datasets: cdd_delta, human_genome, mouse_genome, nrte, refseq_protein, refseq_rna, swissprot, and taxdb datasets.
Blast database - old version 5
According to https://ncbiinsights.ncbi.nlm.nih.gov/tag/blastdbv5/, around Feb. 04, 2020 NCBI dropped the _v5 suffix for the v5 Blast databases.
An old version of the v5 database (download in June 2019) is located at a directory with name pattern as /db/ncbiblast.v5/dbname, where dbname is the name of the database, such as nr, nt, taxdb, refseq_rna, swissprot, etc.
These old versions of the nr and nt databases still have the _v5 suffix, i.e. they are called nr_v5 and nt_v5, respectively.