UBCG-Sapelo2
Category
Bioinformatics
Program On
Sapelo2
Version
3.0
Author / Distributor
Please see https://www.ezbiocloud.net/tools/ubcg
Description
The 16S rRNA gene has played an essential role in bacterial taxonomy by providing a universally applicable phylogenetic framework. However, this single gene contains only about 1,500 bp which inheritably limits the resolution of the analysis. Here, we present the set of bacterial core genes that covers all phyla, which we named UBCG (up-to-date bacterial core gene). The current UBCG set was calculated using complete genomes of 1,492 species covering 28 phyla, consisting of 92 genes.
The package provides the following features: Extraction of UBCGs from genome assemblies Multiple-alignment of 92 gene sequences Concatenation of 92 gene sequences Filtering positions of multiple-alignments Phylogenetic analysis using RAxML and FastTree Calculation of Gene Support Index (GSI) which indicates how many genes support the branch in the concatenated phylogenetic tree (named UBCG tree)
Running Program
- Version 3.0 is installed at /apps/eb/UBCG/3.0-foss-2019b-Java-1.8.0_144
To use version 3.0, please load the module with
ml UBCG/3.0-foss-2019b-Java-1.8.0_144
Here is an example of a shell script, sub.sh, to run on the batch queue:
#!/bin/bash
#SBATCH --job-name=ubcg
#SBATCH --partition=batch
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=10gb
#SBATCH --time=2:00:00
#SBATCH --output=log.%j.out
#SBATCH --error=log.%j.err
#SBATCH --mail-user=username@uga.edu
#SBATCH --mail-type=ALL
cd $SLURM_SUBMIT_DIR
ml UBCG/3.0-foss-2019b-Java-1.8.0_144
#copy these two files to your working directory for UBCG to run properly
cp $EBROOTUBCG/UBCG.hmm ./
cp $EBROOTUBCG/programPath ./
java -jar $EBROOTUBCG/UBCG.jar extract -i <sequence_file>.fa -bcg_dir <output_directory> -label "<Genus species>"
In the real submission script, at least all the above underlined values need to be reviewed or to be replaced by the proper values.
Here is an example of job submission command:
sbatch ./sub.sh
Documentation
ml UBCG/3.0-foss-2019b-Java-1.8.0_144 java -jar $EBROOTUBCG/UBCG.jar --help ############################### # UBCG ver. 3.0 # ############################### UBCG is a program for phylogenetic analysis using the single-copy bacterial core genes The external programs that are used in the UBCG should be installed The locations of programs should be written in 'programPath' file Basic options -h : Show usage and options (--help) ---- Running phylogenetic analysis workflow through two-step command ---- <Step 1> Searching UBCGs from a genome sequence java -jar UBCG.jar extract [-i genome_file] [-bcg_dir bcg_directory] [-label full_label] Mandatory arguments -i <String> : Name of input fasta file containing contigs -bcg_dir <String> : Directory for saving a bcg file -label <String> : Full label of the strain/genome Optional arguments -g <Integer> : A translation table to be used for translation (-g parameter used in Prodigal) Use this option when you enter a genome of species which uses other genetic code Most of the bacteria uses the 11 table (Default : 11, the bacterial and archaeal code) -t <Integer> : Number of threads to be used (Default : 8) Metadata of input genome(Optional) -taxon <String> : Name of the species -strain <String> : Name of the strain -type : Add this if the strain is a type strain of species or subspecies -acc <String> : Accession of genome sequence. Usually NCBI's assembly accession is used for public data -uid <Integer> : This is a unique integer id. If you do not designate, one will be automatically generated -taxonomy <String> : Taxonomy of the species <Step 2> Aligning each UBCG and concatenating them and inferring a phylogeny The external program used for phylogeny reconstruction can be FastTree or RAxML java -jar UBCG.jar align [-bcg_dir bcg_directory] Mandatory arguments -bcg_dir <String> : Directory of bcg files that you want to include in the alignment Optional arguments -out_dir <String> : Directory where all output files will be (Default : "output") -prefix <String> : A prefix is appended to all output files to recognize each different run -a <String> : Type of sequences to be aligned and used for the phylogenetic analysis (Default : codon) Enter one among the below options nt : Use nucleotide sequences codon : Use nucleotide sequences that are aligned based on amino acid alignments codon12 : Use only 1st & 2nd positions of codon aa : Use amino acid sequences -m <String> : A model used to infer trees (Default : JTT+CAT for a protein alignment / GTR+CAT for a nucleotide alignment) --Models (See RAxML or FastTree manual for the detailed information) For RAxML - NUCLEOTIDE sequences GTRCAT[X], GTRCATI[X], ASC_GTRCAT[X], GTRGAMMA[X], ASC_GTRGAMMA[X], GTRGAMMAI[X] - AMINO ACID sequences PROTCATmatrixName[F|X], PROTCATImatrixName[F|X], ASC_PROTCATmatrixName[F|X], PROTGAMMAmatrixName[F|X], ASC_PROTGAMMAmatrixName[F|X], PROTGAMMAImatrixName[F|X] Available aa matrixName: DAYHOFF, DCMUT, JTT, MTREV, WAG, RTREV, CPREV, VT, BLOSUM62, MTMAM, LG, MTART, MTZOA, PMB, HIVB, HIVW, JTTDCMUT, FLU, STMTREV, DUMMY, DUMMY2, AUTO, LG4M, LG4X, PROT_FILE, GTR_UNLINKED, GTR (optional appendix "F": Use empirical base frequencies) (optional appendix "X": Use a ML estimate of base frequencies) For FastTree - NUCLEOTIDE sequences JCcat, GTRcat, JCgamma, GTRgamma - AMINO ACID sequences JTTcat, LGcat, WAGcat, JTTgamma, LGgamma, WAGgamma -t <Integer> : Number of threads to be used (Default : 1) -f <Value> : Filter gap-containing positions Enter a value between 1~100 (Default: 50) ex) 30 : select the positions that have bases of 30% or more (remove the positions composed of more than 70% gap characters) -raxml : Use RAxML for phylogeny reconstruction (Default: FastTree) -zZ : Make zZ-formatted files This additionally creates fasta/nwk files with zZ'uid'zZ format for names of each genome -gsi_threshold: The threshold used for GSI calculations (1~100) Even if the exact bipartition doesn't exist in a gene tree, it is regarded as a supported bipartition based on their similiarity It is when the number of genomes more than specified threshold(percentage) of all genomes(leaves) support the topology of bipartition A value of 95 or higher is recommended (Default: 95) ---- Replacing labels ----- Replacing labels in a tree using metadata java -jar UBCG.jar replace <run_id..trm> <gene> ex) java -jar UBCG.jar replace 'run_id'.trm UBCG -taxon -strain Making a UBCG tree file that uses both taxon_name and strain_name as labels ex) java -jar UBCG.jar replace 'run_id'.trm rpoB -acc -taxon -type Making a rpoB tree file that uses accession, taxon_name and strain_name as labels Parameters <run_id.trm> : A file containing trees with metadata. This file is automatically generated in the step 2 <gene> : The gene trees or UBCG tree you want to replace -uid : add uids -acc : add accessions -label : add labels -taxon : add taxon_names -strain : add strain_names -type : add type_info -taxonomy : add taxonomy ---- Showing the content of bcg files ---- The content of bcg files (for example, gene sequences) can be viewed (as csv format that is readable by Microsoft Excel) by using the following command: java -jar UBCG.jar view -i <a bcg file> java -jar UBCG.jar view -d <directory containing bcg files>
Installation
Source code is download from https://www.ezbiocloud.net/tools/ubcg
System
64-bit Linux