UBCG-Sapelo2

From Research Computing Center Wiki
Jump to navigation Jump to search

Category

Bioinformatics

Program On

Sapelo2

Version

3.0

Author / Distributor

Please see https://www.ezbiocloud.net/tools/ubcg

Description

The 16S rRNA gene has played an essential role in bacterial taxonomy by providing a universally applicable phylogenetic framework. However, this single gene contains only about 1,500 bp which inheritably limits the resolution of the analysis. Here, we present the set of bacterial core genes that covers all phyla, which we named UBCG (up-to-date bacterial core gene). The current UBCG set was calculated using complete genomes of 1,492 species covering 28 phyla, consisting of 92 genes.

The package provides the following features: Extraction of UBCGs from genome assemblies Multiple-alignment of 92 gene sequences Concatenation of 92 gene sequences Filtering positions of multiple-alignments Phylogenetic analysis using RAxML and FastTree Calculation of Gene Support Index (GSI) which indicates how many genes support the branch in the concatenated phylogenetic tree (named UBCG tree)

Running Program

  • Version 3.0 is installed at /apps/eb/UBCG/3-foss-2022a-Java-8.402

To use version 3.0, please load the module with

ml UBCG/3-foss-2022a-Java-8.402


Here is an example of a shell script, sub.sh, to run on the batch queue:

#!/bin/bash
#SBATCH --job-name=ubcg
#SBATCH --partition=batch
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=10gb
#SBATCH --time=2:00:00
#SBATCH --output=log.%j.out
#SBATCH --error=log.%j.err
#SBATCH --mail-user=username@uga.edu
#SBATCH --mail-type=ALL

cd $SLURM_SUBMIT_DIR
ml UBCG/3-foss-2022a-Java-8.402

#copy these two files to your working directory for UBCG to run properly

cp $EBROOTUBCG/UBCG.hmm ./
cp $EBROOTUBCG/programPath ./

java -jar $EBROOTUBCG/UBCG.jar extract -i <sequence_file>.fa -bcg_dir <output_directory> -label "<Genus species>"

In the real submission script, at least all the above underlined values need to be reviewed or to be replaced by the proper values.


Here is an example of job submission command:

sbatch ./sub.sh 

Documentation

ml UBCG/3.0-foss-2019b-Java-1.8.0_144
java -jar $EBROOTUBCG/UBCG.jar --help


          ###############################
          #         UBCG ver. 3.0       #
          ###############################


UBCG is a program for phylogenetic analysis using the single-copy bacterial core genes

The external programs that are used in the UBCG should be installed
The locations of programs should be written in 'programPath' file

Basic options
-h : Show usage and options (--help)


---- Running phylogenetic analysis workflow through two-step command ----

<Step 1>
Searching UBCGs from a genome sequence

java -jar UBCG.jar extract [-i genome_file] [-bcg_dir bcg_directory] [-label full_label]

Mandatory arguments
-i       <String> : Name of input fasta file containing contigs
-bcg_dir <String> : Directory for saving a bcg file
-label   <String> : Full label of the strain/genome

Optional arguments
-g <Integer> : A translation table to be used for translation (-g parameter used in Prodigal)
               Use this option when you enter a genome of species which uses other genetic code
               Most of the bacteria uses the 11 table
              (Default : 11, the bacterial and archaeal code)
-t <Integer> : Number of threads to be used
              (Default : 8)

Metadata of input genome(Optional)
-taxon    <String> : Name of the species
-strain   <String> : Name of the strain
-type              : Add this if the strain is a type strain of species or subspecies
-acc      <String> : Accession of genome sequence. Usually NCBI's assembly accession is used for public data
-uid     <Integer> : This is a unique integer id. If you do not designate, one will be automatically generated
-taxonomy <String> : Taxonomy of the species


<Step 2>
Aligning each UBCG and concatenating them and inferring a phylogeny
The external program used for phylogeny reconstruction can be FastTree or RAxML

java -jar UBCG.jar align [-bcg_dir bcg_directory]

Mandatory arguments
-bcg_dir <String> : Directory of bcg files that you want to include in the alignment

Optional arguments
-out_dir <String> : Directory where all output files will be
                   (Default : "output")
-prefix  <String> : A prefix is appended to all output files to recognize each different run
-a  <String>  : Type of sequences to be aligned and used for the phylogenetic analysis
               (Default : codon)
            Enter one among the below options
                 nt : Use nucleotide sequences
              codon : Use nucleotide sequences that are aligned based on amino acid alignments
            codon12 : Use only 1st & 2nd positions of codon
                 aa : Use amino acid sequences
-m  <String>  : A model used to infer trees
               (Default : JTT+CAT for a protein alignment / GTR+CAT for a nucleotide alignment)
                        --Models (See RAxML or FastTree manual for the detailed information)
                       For RAxML - NUCLEOTIDE sequences
                                   GTRCAT[X], GTRCATI[X], ASC_GTRCAT[X],
                                   GTRGAMMA[X], ASC_GTRGAMMA[X], GTRGAMMAI[X]
                                 - AMINO ACID sequences
                                   PROTCATmatrixName[F|X], PROTCATImatrixName[F|X],
                                   ASC_PROTCATmatrixName[F|X], PROTGAMMAmatrixName[F|X],
                                   ASC_PROTGAMMAmatrixName[F|X], PROTGAMMAImatrixName[F|X]
                         Available aa matrixName: DAYHOFF, DCMUT, JTT, MTREV, WAG, RTREV, CPREV, VT, 
                                                  BLOSUM62, MTMAM, LG, MTART, MTZOA, PMB, HIVB, HIVW, 
                                                  JTTDCMUT, FLU, STMTREV, DUMMY, DUMMY2, AUTO, LG4M, 
                                                  LG4X, PROT_FILE, GTR_UNLINKED, GTR
                         (optional appendix "F": Use empirical base frequencies)
                         (optional appendix "X": Use a ML estimate of base frequencies)
                    For FastTree - NUCLEOTIDE sequences
                                   JCcat, GTRcat, JCgamma, GTRgamma
                                 - AMINO ACID sequences
                                   JTTcat, LGcat, WAGcat, JTTgamma, LGgamma, WAGgamma
-t  <Integer> : Number of threads to be used
               (Default : 1)
-f  <Value>   : Filter gap-containing positions
                Enter a value between 1~100
               (Default: 50)
               ex) 30 : select the positions that have bases of 30% or more 
               (remove the positions composed of more than 70% gap characters)
-raxml        : Use RAxML for phylogeny reconstruction
               (Default: FastTree)
-zZ           : Make zZ-formatted files
                This additionally creates fasta/nwk files with zZ'uid'zZ format for names of each genome
-gsi_threshold: The threshold used for GSI calculations (1~100)
                Even if the exact bipartition doesn't exist in a gene tree, it is regarded as a supported bipartition based on their similiarity
                It is when the number of genomes more than specified threshold(percentage) of all genomes(leaves) support the topology of bipartition
                A value of 95 or higher is recommended
               (Default: 95)

---- Replacing labels -----
Replacing labels in a tree using metadata

java -jar UBCG.jar replace <run_id..trm> <gene> 
ex) java -jar UBCG.jar replace 'run_id'.trm UBCG -taxon -strain
  Making a UBCG tree file that uses both taxon_name and strain_name as labels
ex) java -jar UBCG.jar replace 'run_id'.trm rpoB -acc -taxon -type
  Making a rpoB tree file that uses accession, taxon_name and strain_name as labels

Parameters
<run_id.trm> : A file containing trees with metadata. This file is automatically generated in the step 2
      <gene> : The gene trees or UBCG tree you want to replace 
       -uid  : add uids
       -acc  : add accessions
     -label  : add labels
     -taxon  : add taxon_names
    -strain  : add strain_names
      -type  : add type_info
   -taxonomy : add taxonomy

---- Showing the content of bcg files ----
The content of bcg files (for example, gene sequences) can be viewed (as csv format that is readable by Microsoft Excel) by using the following command:

java -jar UBCG.jar view -i <a bcg file>
java -jar UBCG.jar view -d <directory containing bcg files>


Back to Top

Installation

Source code is download from https://www.ezbiocloud.net/tools/ubcg

System

64-bit Linux