Difference between revisions of "BLAST Databases-Sapelo2"

From Research Computing Center Wiki
Jump to navigation Jump to search
(added more clarity on what a blast db is and that refseq_genomic is unavailable.)
(removed reference to redundant /db/ncbiblast databases organized by name. Added more info about loading databases with modules)
Line 1: Line 1:
 
[[Category:Sapelo2]][[Category:Software]][[Category:Bioinformatics]][[Category:Bioinformatics Database]]     
 
[[Category:Sapelo2]][[Category:Software]][[Category:Bioinformatics]][[Category:Bioinformatics Database]]     
  
'''Blast database'''
+
'''Blast databases'''
  
Blast databases are pre-formatted to work with blast commands from BLAST and BLAST+. These databases are similar to what you could build using the makeblastdb command with FASTA files. Each database is located at a directory with name pattern as /db/ncbiblast/''dbname''/''month-day-year'', where ''dbname'' is the name of the database, such as nr, nt, taxdb, refseq_rna, refseq_protein, mouse_genome, human_genome, swissprot, etc. The database refseq_genomic is no longer available to download as a pre-formatted database.   
+
Blast databases are pre-formatted to work with blast commands from BLAST and BLAST+. These databases are similar to what you could build using the makeblastdb command with FASTA files. Each database is located at a directory with name pattern as /db/ncbiblasty/year-month-date. For example, databases downloaded on February 01, 2022 will be available in /db/ncbiblast/20220201. The following databases are currently available and stored together with databases downloaded at the same time; nr, nt, taxdb, refseq_rna, refseq_protein, mouse_genome, human_genome and swissprot. The database refseq_genomic is no longer available to download as a pre-formatted database.   
  
For example, an nr database downloaded on June 04, 2020 will be available in /db/ncbiblast/nr/06042020.
 
  
Available databases:
+
NCBI BLAST Datasets can be loaded in a similar way to software modules. This allows users to replicate results by always being able to use a time stamped version of a database. The time stamp corresponds to the download date of the dataset. To search for available NCBI BLAST dataset modules run the command:
<pre class="gscript">
+
 
/db/ncbiblast/human_genome/03012021
+
<code>ml spider ncbiblastdb</code>
/db/ncbiblast/mouse_genome/03012021
+
 
/db/ncbiblast/nr/03012021
+
You can then load a database by running the command module load. For example to load the databases timestamped at 06/16/2021:
/db/ncbiblast/nt/03012021
+
 
/db/ncbiblast/refseq_rna/03012021
+
<code>module load ncbiblastdb/20210616</code>
/db/ncbiblast/refseq_protein/03012021
+
 
/db/ncbiblast/swissprot/03012021
+
Loading this module sets the environment variable BLASTDB to /db/ncbiblast/20210616. This allows BLAST+ to search that directory for databases. You can then use the name of the database you would like to use.(nt, nr, etc.) Here is an example of a shell script to run BLAST+ on the batch queue using the database nt:
/db/ncbiblast/taxdb/03012021
+
 
</pre>
+
 
 +
NCBI's nr and nt datasets are the most commonly used at the GACRC. GACRC staff will update these datasets every month as well as other NCBI BLAST datasets: cdd_delta, human_genome, mouse_genome, nrte, refseq_protein, refseq_rna, swissprot, and taxdb datasets.
  
  

Revision as of 01:17, 17 February 2022


Blast databases

Blast databases are pre-formatted to work with blast commands from BLAST and BLAST+. These databases are similar to what you could build using the makeblastdb command with FASTA files. Each database is located at a directory with name pattern as /db/ncbiblasty/year-month-date. For example, databases downloaded on February 01, 2022 will be available in /db/ncbiblast/20220201. The following databases are currently available and stored together with databases downloaded at the same time; nr, nt, taxdb, refseq_rna, refseq_protein, mouse_genome, human_genome and swissprot. The database refseq_genomic is no longer available to download as a pre-formatted database.


NCBI BLAST Datasets can be loaded in a similar way to software modules. This allows users to replicate results by always being able to use a time stamped version of a database. The time stamp corresponds to the download date of the dataset. To search for available NCBI BLAST dataset modules run the command:

ml spider ncbiblastdb

You can then load a database by running the command module load. For example to load the databases timestamped at 06/16/2021:

module load ncbiblastdb/20210616

Loading this module sets the environment variable BLASTDB to /db/ncbiblast/20210616. This allows BLAST+ to search that directory for databases. You can then use the name of the database you would like to use.(nt, nr, etc.) Here is an example of a shell script to run BLAST+ on the batch queue using the database nt:


NCBI's nr and nt datasets are the most commonly used at the GACRC. GACRC staff will update these datasets every month as well as other NCBI BLAST datasets: cdd_delta, human_genome, mouse_genome, nrte, refseq_protein, refseq_rna, swissprot, and taxdb datasets.


Blast database - old version 5

According to https://ncbiinsights.ncbi.nlm.nih.gov/tag/blastdbv5/, around Feb. 04, 2020 NCBI dropped the _v5 suffix for the v5 Blast databases.

An old version of the v5 database (download in June 2019) is located at a directory with name pattern as /db/ncbiblast.v5/dbname, where dbname is the name of the database, such as nr, nt, taxdb, refseq_rna, swissprot, etc.

These old versions of the nr and nt databases still have the _v5 suffix, i.e. they are called nr_v5 and nt_v5, respectively.

Back to Top