Difference between revisions of "Bioinformatics Databases"

Revision as of 12:05, 28 June 2021

As part of our services, the GACRC builds and hosts local copies of frequently cited application data, and provides assistance for sharing data among GACRC members. Datasets are organized by date and are uploaded at the beginning of each month. They are located in the commonly shared "/work/db" filesystem.

Datasets can be loaded in a similar way to software modules. This allows users to replicate results by always being able to use a time stamped version of a database.To search for available dataset modules run the command:

ml spider ncbiblastdb

You can then load a database by running the command module load. For example to load the databases timestamped at 06/16/2021:

module load ncbiblastdb/20210616

Loading this module set the environment variable BLASTDB to /db/ncbiblast/20210616. You can then use the name of the database you would like to use. Here is an example of a shell script to run blast on the batch queue using the database nt:

#!/bin/bash
#SBATCH --job-name=j_BLAST+
#SBATCH --partition=batch
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@uga.edu
#SBATCH --ntasks=1
#SBATCH --mem=10gb
#SBATCH --time=08:00:00
#SBATCH --output=BLAST+.%j.out
#SBATCH --error=BLAST+.%j.err

cd $SLURM_SUBMIT_DIR
module load BLAST+/2.9.0-gompi-2019b
module load ncbiblastdb/20210616
blastn [options] -db nt

In the real submission script, at least all the above underlined values need to be reviewed or to be replaced by the proper values.

Of the public local databases in /work/db, NCBI's nr and nt datasets are the most commonly used at the GACRC. GACRC staff will update these datasets every month as well as the cdd, human_genome, mouse_genome, nrte, refseq_protein, refseq_rna, swissprot, and taxdb datasets.

Various subject datasets, e.g. pfam, bowtie indexes of human and mouse, and NCBI bacterial datasets, are hosted as well. These datasets either don’t need updating, the source does not update frequently, or the datasets are not used frequently by our users. These datasets will only be updated by user request, at which point a module will be created so that the user can be assured they can replicate results in the future.

For datasets requested by individual lab groups, GACRC staff will assist in setting up a group-shared environment and request that group members maintain their database files there.

For datasets frequently updated by their source, the GACRC encourages users to maintain their own copies of these public databases.

The set of data which are regularly updated is open to review, and can be expanded based on available GACRC resources.

Installed Bioinformatics Databases

Name	Version	Cluster
akblab	01/28/2018	Sapelo2
Bacteria NCBI	12/21/2017	Sapelo2
hg19,mm10 & mm9	07/29/2016	Sapelo2
cellranger	10/29/2019	Sapelo2
conspred_ressources	02/02/2017	Sapelo2
gss		Sapelo2
htgs		Sapelo2
hg		Sapelo2
NCBI BLAST Database	every other month	Sapelo2
NCBI Fasta	every other month	Sapelo2
pfam	27.0	Sapelo2
Refseq		Sapelo2
TaxDB		Sapelo2
Uniprot	06/28/2018	Sapelo2
Uniref	06/28/2018	Sapelo2
wublast		Sapelo2
decontaMiner	08/30/2019	Sapelo2

@@ Line 4: / Line 4: @@
 <code>ml spider ncbiblastdb</code>
+You can then load a database by running the command module load. For example to load the databases timestamped at 06/16/2021:
+<code>module load ncbiblastdb/20210616</code>
+Loading this module set the environment variable BLASTDB to /db/ncbiblast/20210616. You can then use the name of the database you would like to use. Here is an example of a shell script to run blast on the batch queue using the database nt:
+<div class="gscript2">
+<nowiki>#</nowiki>!/bin/bash<br>
+<nowiki>#</nowiki>SBATCH --job-name=j_BLAST+<br>
+<nowiki>#</nowiki>SBATCH --partition=batch<br>
+<nowiki>#</nowiki>SBATCH --mail-type=ALL<br>
+<nowiki>#</nowiki>SBATCH --mail-user=<u>username@uga.edu</u><br>
+<nowiki>#</nowiki>SBATCH --ntasks=<u>1</u><br>
+<nowiki>#</nowiki>SBATCH --mem=<u>10gb</u><br>
+<nowiki>#</nowiki>SBATCH --time=<u>08:00:00</u><br>
+<nowiki>#</nowiki>SBATCH --output=BLAST+.%j.out<br>
+<nowiki>#</nowiki>SBATCH --error=BLAST+.%j.err<br>
+cd $SLURM_SUBMIT_DIR<br>
+module load BLAST+/2.9.0-gompi-2019b<br>
+module load ncbiblastdb/20210616<br>
+blastn <u>[options]</u> -db nt<br>
+</div>
+In the real submission script, at least all the above underlined values need to be reviewed or to be replaced by the proper values.
 Of the public local databases in /work/db, NCBI's nr and nt datasets are the most commonly used at the GACRC. GACRC staff will update these datasets every month as well as the cdd, human_genome, mouse_genome, nrte, refseq_protein, refseq_rna, swissprot, and taxdb datasets.

Difference between revisions of "Bioinformatics Databases"

Revision as of 12:05, 28 June 2021

Installed Bioinformatics Databases

Navigation menu

Search