BLAST Databases-Sapelo2: Difference between revisions
(Updated the list of available datasets that came with the download on 08/02/2024) |
(Updated to reflect the new version of the databases created on 11/01/2024) |
||
(One intermediate revision by the same user not shown) | |||
Line 43: | Line 43: | ||
<code>ml spider ncbiblastdb</code> | <code>ml spider ncbiblastdb</code> | ||
You can then load a database by running the command module load. For example to load the databases timestamped | You can then load a database by running the command module load. For example to load the databases timestamped on 11/02/2024: | ||
<code>module load ncbiblastdb/ | <code>module load ncbiblastdb/20241102</code> | ||
Loading this module sets the environment variable BLASTDB to /db/ncbiblast/ | Loading this module sets the environment variable BLASTDB to /db/ncbiblast/20241102. This allows BLAST+ to search that directory for databases. You can then use the name of the database you would like to use.(nt, nr, etc.) Here is an example of a shell script to run BLAST+ on the batch queue using the database nt: | ||
<div class="gscript2"> | <div class="gscript2"> | ||
<nowiki>#</nowiki>!/bin/bash<br> | <nowiki>#</nowiki>!/bin/bash<br> | ||
Line 61: | Line 61: | ||
cd $SLURM_SUBMIT_DIR<br> | cd $SLURM_SUBMIT_DIR<br> | ||
module load BLAST+/2. | module load BLAST+/2.14.1-gompi-2023a<br> | ||
module load ncbiblastdb/ | module load ncbiblastdb/20241102<br> | ||
blastn -query example.fasta -out results.out -db nt<br> | blastn -query example.fasta -out results.out -db nt<br> | ||
</div> | </div> | ||
Line 68: | Line 68: | ||
In your actual submission script, use your own discretion for the Slurm header values. | In your actual submission script, use your own discretion for the Slurm header values. | ||
NCBI's nr and nt datasets are the most commonly used at the GACRC. GACRC staff will update these datasets '''every 3 months''' | NCBI's nr and nt datasets are the most commonly used at the GACRC. GACRC staff will update these datasets '''every 3 months''' along with other potentially helpful NCBI BLAST datasets. | ||
Latest revision as of 13:43, 1 November 2024
Blast databases
Blast databases are pre-formatted to work with blast commands from BLAST and BLAST+. These databases are similar to what you could build using the makeblastdb command with FASTA files. Each database is located at a directory with the name pattern /db/ncbiblast/YYYYMMDD. For example, databases downloaded on February 01, 2022 will be available in /db/ncbiblast/20220201. Starting with the download on August 2nd, 2024 (i.e., /db/ncbiblast/20240802), the following databases are currently available and stored together with databases downloaded at the same time:
nr nt taxdb refseq_rna refseq_protein mouse_genome human_genome swissprot cdd_delta env_nr env_nt tsa_nr tsa_nt taxdb ref_prok_rep_genomes ref_euk_rep_genomes ref_viroids_rep_genomes ref_viruses_rep_genomes 16S_ribosomal_RNA 18S_fungal_sequences 28S_fungal_sequences LSU_eukaryote_rRNA LSU_prokaryote_rRNA SSU_eukaryote_rRNA Betacoronavirus ITS_RefSeq_Fungi ITS_eukaryote_sequences mito
Note that the database refseq_genomic is no longer available to download as a pre-formatted database.
NCBI BLAST Datasets can be loaded just like software modules. This allows users to replicate results by always being able to use a time stamped version of a database. The time stamp corresponds to the download date of the dataset. To search for available NCBI BLAST dataset modules run the command:
ml spider ncbiblastdb
You can then load a database by running the command module load. For example to load the databases timestamped on 11/02/2024:
module load ncbiblastdb/20241102
Loading this module sets the environment variable BLASTDB to /db/ncbiblast/20241102. This allows BLAST+ to search that directory for databases. You can then use the name of the database you would like to use.(nt, nr, etc.) Here is an example of a shell script to run BLAST+ on the batch queue using the database nt:
#!/bin/bash
#SBATCH --job-name=j_BLAST+
#SBATCH --partition=batch
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@uga.edu
#SBATCH --ntasks=1
#SBATCH --mem=10gb
#SBATCH --time=08:00:00
#SBATCH --output=BLAST+.%j.out
#SBATCH --error=BLAST+.%j.err
cd $SLURM_SUBMIT_DIR
module load BLAST+/2.14.1-gompi-2023a
module load ncbiblastdb/20241102
blastn -query example.fasta -out results.out -db nt
In your actual submission script, use your own discretion for the Slurm header values.
NCBI's nr and nt datasets are the most commonly used at the GACRC. GACRC staff will update these datasets every 3 months along with other potentially helpful NCBI BLAST datasets.
Blast database - old version 5
According to https://ncbiinsights.ncbi.nlm.nih.gov/tag/blastdbv5/, around Feb. 04, 2020 NCBI dropped the _v5 suffix for the v5 Blast databases.
An old version of the v5 database (download in June 2019) is located at a directory with name pattern as /db/ncbiblast.v5/dbname, where dbname is the name of the database, such as nr, nt, taxdb, refseq_rna, swissprot, etc.
These old versions of the nr and nt databases still have the _v5 suffix, i.e. they are called nr_v5 and nt_v5, respectively.