Bioinformatics Databases

From Research Computing Center Wiki
Revision as of 11:34, 29 June 2021 by Keeko (talk | contribs)
Jump to navigation Jump to search

As part of our services, the GACRC builds and hosts local copies of frequently cited application data, and provides assistance for sharing data among GACRC members.Datasets are located in the commonly shared "/db" filesystem. NCBI BLAST datasets are located in "/db/ncbiblast/" and are organized by date.

NCBI BLAST Datasets can be loaded in a similar way to software modules. This allows users to replicate results by always being able to use a time stamped version of a database. The time stamp corresponds to the download date of the dataset. To search for available CCBI BLAST+ dataset modules run the command:

ml spider ncbiblastdb

You can then load a database by running the command module load. For example to load the databases timestamped at 06/16/2021:

module load ncbiblastdb/20210616

Loading this module sets the environment variable BLASTDB to /db/ncbiblast/20210616. This allows BLAST+ to search that directory for databases. You can then use the name of the database you would like to use.(nt, nr, etc.) Here is an example of a shell script to run BLAST on the batch queue using the database nt:

#!/bin/bash
#SBATCH --job-name=j_BLAST+
#SBATCH --partition=batch
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@uga.edu
#SBATCH --ntasks=1
#SBATCH --mem=10gb
#SBATCH --time=08:00:00
#SBATCH --output=BLAST+.%j.out
#SBATCH --error=BLAST+.%j.err

cd $SLURM_SUBMIT_DIR
module load BLAST+/2.9.0-gompi-2019b
module load ncbiblastdb/20210616
blastn [options] -db nt

In the real submission script, at least all the above underlined values need to be reviewed or to be replaced by the proper values.

NCBI's nr and nt datasets are the most commonly used at the GACRC. GACRC staff will update these datasets every month as well as other NCBI BLST datasets: cdd, human_genome, mouse_genome, nrte, refseq_protein, refseq_rna, swissprot, and taxdb datasets.

Various subject datasets, e.g. pfam, bowtie indexes of human and mouse, and NCBI bacterial datasets, are hosted as well. However, these datasets are not accessed with a module. These datasets either don’t need updating, the source does not update frequently, or the datasets are not used frequently by our users. These datasets will only be updated by user request. Instead of using a module, to use these databases, use the path to the data bases in your script.


For datasets requested by individual lab groups, GACRC encourages users to maintain their own copies of databases, whose files can be installed in their group's shared "/work" area. Similarly for datasets frequently updated by their source, the GACRC encourages users to maintain their own copies of these public databases.

The set of data which are regularly updated is open to review, and can be expanded based on available GACRC resources.



Installed Bioinformatics Databases

Name Version Module Available
akblab 01/28/2018 no
Bacteria NCBI 12/21/2017 no
hg19,mm10 & mm9 07/29/2016 no
cellranger 10/29/2019 mo
conspred_ressources 02/02/2017 no
dammit 03/18/2021 no
deconseq 12/01/2016 no
decontaMiner 08/29/2019 no
eggnog-mapper 04/16/2019 no
funannotate 08/15/2019 no
DTDB-Tk 01/15/2021 no
hg 09/20/2016 no
kegg 09/09/2016 no
maker 05/22/2018 no
MetaCLADE 08/29/2019 no
NCBI BLAST Databases beginning of every month yes
NCBI Fasta beginning of every month yes
ngs 11/03/2016 no
nndb 04/29/2020 no
PB 04/24/2019 no
pfam 03/02/2020 no
pylosift 07/20/2020 no
Refseq_genomic no
repbase 01/09/2017 no
rfam 02/09/2017 no
seqdb 05/23/2019 no
sortmerna 02/02/2021 no
TaxDB 02/26/2019 no
topcons2 02/18/2021 no
Uniprot 11/19/2020 no
Uniref 06/28/2018 no
wublast 03/01/2020 no