PASTA-Teaching

From Research Computing Center Wiki
Revision as of 14:31, 1 April 2019 by Pakala (talk | contribs) (Created page with "Category:TeachingCategory:SoftwareCategory:Bioinformatics === Category === Bioinformatics === Program On === Teaching === Version === 1.8.2 ===Author / D...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search


Category

Bioinformatics

Program On

Teaching

Version

1.8.2

Author / Distributor

[1]

Description

PASTA (Practical Alignment using SATe and Transitivity.

Running Program

Also refer to Running Jobs on the teaching cluster

For more information on Environment Modules please see the Lmod page.

  • Version 1.8.2, installed in /usr/local/apps/gb/PASTA/1.8.2

To use this version of PASTA, please first load the module with


module load PASTA/1.8.2-foss-2016b-Python-2.7.14

This module sets up the path to use PASTA with Python 2.7.14

Sample job submission script (sub.sh) to run run_pasta.py

#!/bin/bash
#SBATCH --job-name=j_Pasta
#SBATCH --partition=batch
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@uga.edu
#SBATCH --ntasks=1
#SBATCH --mem=10gb
#SBATCH --time=4:00:00
#SBATCH --output=Pasta.%j.out
#SBATCH --error=Pasta.%j.err

cd $SLURM_SUBMIT_DIR

module load PASTA/1.8.2-foss-2016b-Python-2.7.14

run_pasta.py [options]

In the real submission script, at least all the above underlined values need to be reviewed or to be replaced by the proper values.

Please refer to Running_Jobs_on_the_teaching_cluster, Run X window Jobs and Run interactive Jobs for more details of running jobs at Teaching cluster.

Here is an example of job submission command:

sbatch ./sub.sh 

Documentation

[pakala@teach ~]$ module load PASTA/1.8.2-foss-2016b-Python-2.7.14
[pakala@teach ~]$ run_pasta.py -h
Usage: run_pasta.py [options] <settings_file1> <settings_file2> ...


PASTA performs iterative realignment and tree inference, similar to SATe, but
uses a very different merge algorithm which improves running time, memory
usage, and accuracy. The current code is heavily based on SATe, with lots of
modifications, many related to algorithmic differences between PASTA and SATe,
but also many scalability improvements (parallelization, tree parsing,
defaults, etc.)

Minimally you must provide a sequence file (with the '--input' option); a
starting tree is optional. By default, important algorithmic parameters are
set based on automatic rules.

The command line allows you to alter the behavior of the algorithm
(termination criteria, when the algorithm switches to "Blind" acceptance of
new alignments, how the tree is decomposed to find subproblems to be used, and
the external tools to use).

Options can also be passed in as configuration files.

With the format:
####################################################
[commandline]
option-name = value

[sate]
option-name = value
####################################################

With every run, PASTA saves the configuration file for that run as a temporary
file called [jobname]_temp_pasta_config.txt in your output directory.

Configuration files are read in the order they occur as arguments (with values
in later files replacing previously read values). Options specified in the
command line are read last. Thus these values "overwrite" any settings from
the configuration files. Note that the use of --auto option can overwrite some
of the other options provided by commandline or through configuration files.


Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit

  commandline options:
    -a, --aligned       If used, then the input file be will treated as
                        aligned for the purposes of the first round of tree
                        inference (the algorithm will start with tree
                        searching on the input before re-aligning). This
                        option only applies if a starting tree is NOT given.
    --auto              This option is mostly for backward compatibility. If
                        used, then automatically identified default values for
                        the max_subproblem_size, number of cpus, tools,
                        breaking strategy, masking criteria, and stopping
                        criteria will be used. This is just like using the
                        default options. However, [WARNING] when auto option
                        is used PASTA overrides the value of these options
                        even if you have supplied them; we recommend that you
                        run this option with --exportconfig to see the exact
                        set of options that will be used in your analysis.
    -d DATATYPE, --datatype=DATATYPE
                        Specify DNA, RNA, or Protein to indicate what type of
                        data is specified. Note that this option is NOT
                        automatically determined [default: dna]
    --exportconfig=EXPORTCONFIG
                        Export the configuration to the specified file and
                        exit. This is useful if you want to combine several
                        configurations and command line settings into a single
                        configuration file to be used in other analyses.
    -i INPUT, --input=INPUT
                        input sequence file
    -j JOB, --job=JOB   job name [pastajob]
    --keepalignmenttemps
                        Keep even the realignment temporary running files
                        (this only has an effect if keeptemp is also
                        selected).
    -k, --keeptemp      Keep temporary running files? [default: disabled]
    --missing=MISSING   How to deal with missing data symbols. Specify either
                        "Ambiguous" or "Absent" if the input data contains
                        ?-symbols
    -m, --multilocus    Analyze multi-locus data? NOT SUPPORTED IN CURRENT
                        PASTA version.
    --raxml-search-after
                        If used, the completion of the PASTA algorithm will be
                        followed by a tree search using RAxML on the masked
                        alignment. This can be useful if a very fast and
                        approximate tree estimator is used during the PASTA
                        algorithm. [default: disabled]
    --temporaries=TEMPORARIES
                        directory that will be the parent for this job's
                        temporary file [default in PASTA home]
    --timesfile=TIMESFILE
                        optional file that will store the times of events
                        during the PASTA run. If the file exists, new lines
                        will be
    -t TREEFILE, --treefile=TREEFILE
                        starting tree file
    --two-phase         If used, then the program will not perform the PASTA
                        algorithm. Instead it will simply call the sequence
                        aligner to align the entire dataset then will call the
                        tree estimator to obtain the tree.
    --untrusted         If used, then the data in the input file will be
                        parsed using a more careful procedure. This will
                        generate more helpful error messages, but will use
                        more memory and be much slower for large inputs. If
                        this option is omitted, the error messages resulting
                        from invalid input data will be more cryptic.

  SATe acceptance options:
    --blind-after-iter-without-imp=#
                        Maximum number of iterations without an improvement in
                        likelihood score that PASTA will run before switching
                        to blind mode. [default: disabled]
    --blind-after-time-without-imp=#.#
                        Maximum time (in seconds) that PASTA will run without
                        an improvement in likelihood score before switching to
                        blind mode. [default: disabled]
    --blind-after-total-iter=#
                        Maximum number of iterations that PASTA will run
                        before switching to blind mode. [default: 0]
    --blind-after-total-time=#.#
                        Maximum time (in seconds) that PASTA will run before
                        switching to blind mode. [default: disabled]
    --no-blind-mode-is-final
                        When the blind mode is final, then PASTA will never
                        leave blind mode once it is has entered blind mode.
    --move-to-blind-on-worse-score
                        If True then PASTA will move to the blind mode as soon
                        it encounters a tree/alignment pair with a worse
                        score. This is essentially the same as running in
                        blind mode from the beginning, but it does allow one
                        to terminate a run at an interval from the first time
                        the algorithm fails to improve the score.

  SATe decomposition options:
    --break-strategy=BREAK_STRATEGY
                        The method for choosing an edge when bisecting the
                        tree during decomposition [default: mincluster]
    --max-subproblem-frac=#.#
                        The maximum size (number of leaves) of subproblems
                        specified in terms as a proportion of the total number
                        of leaves.  When a subproblem contains this number of
                        leaves (or fewer), then it will not be decomposed
                        further. [default: automatically picked based on
                        alignment size]
    --max-subproblem-size=#
                        The maximum size (number of leaves) of subproblems.
                        When a subproblem contains this number of leaves (or
                        fewer), then it will not be decomposed further.
                        [default: automatically picked based on alignment
                        size]
    --max-subtree-diameter=#.#
                        The maximum diameter of each subtree. [default: 2.5]
    --min-subproblem-size=#
                        The minimum size (number of leaves) of subproblems.
                        [default: 0]

  SATe output options:
    -o OUTPUT_DIRECTORY, --output-directory=OUTPUT_DIRECTORY
                        directory for output files (defaults to input file
                        directory)
    --no-return-final-tree-and-alignment
                        Return the best likelihood tree and alignment pair
                        instead of those from the last iteration; this is
                        discouraged with masking option enabled.

  SATe platform options:
    --max-mem-mb=#      The maximum memory available to OPAL (for the Java
                        heap size when running Java tools).
    --num-cpus=#        The number of processing cores that you would like to
                        assign to PASTA.  This number should not exceed the
                        number of cores on your machine. [default: number of
                        cores available on the machine]

  SATe searching options:
    --mask-gappy-sites=#
                        The minimum number of non-gap characters required in
                        each column passed to the tree estimation step.
                        Columns with fewer non-gap characters than the given
                        threshold will be masked out before passing the
                        alignment into the tree estimation module. These
                        columns will be present in the final alignment.
                        [default: 0.1% of alignment size]
    --start-tree-search-from-current
                        If selected that the tree from the previous iteration
                        will be given to the tree searching tool as a starting
                        tree.

  SATe spaning-tree options:
    --build-MST         Construct the spanning tree using minimum spanning
                        tree algorithm [default: False]

  SATe termination options:
    --after-blind-iter-term-limit=#
                        The maximum number of iteration that the PASTA
                        algorithm will run after PASTA has entered blind mode.
                        If the number is less than 1, then no iteration limit
                        will be used. [default: disabled]
    --after-blind-iter-without-imp-limit=#
                        The maximum number of iterations without an
                        improvement in score that the PASTA algorithm will run
                        after entering BLIND mode.  If the number is less than
                        1, then no iteration limit will be used. [default:
                        disabled]
    --after-blind-time-term-limit=#.#
                        Maximum time (in seconds) that PASTA will continue
                        starting new iterations of realigning and tree
                        searching after PASTA has entered blind mode. If the
                        number is less than 0, then no time limit will be
                        used. [default: disabled]
    --after-blind-time-without-imp-limit=#.#
                        Maximum time (in seconds) since the last improvement
                        in score that PASTA will continue starting new
                        iterations of realigning and tree searching after
                        entering BLIND mode. If the number is less than 0,
                        then no time limit will be used. [default: disabled]
    --iter-limit=#      The maximum number of iteration that the PASTA
                        algorithm will run.  If the number is less than 1,
                        then no iteration limit will be used. [default: 3]
    --iter-without-imp-limit=#
                        The maximum number of iterations without an
                        improvement in score that the PASTA algorithm will
                        run.  If the number is less than 1, then no iteration
                        limit will be used. [default: disabled]
    --time-limit=#.#    Maximum time (in seconds) that PASTA will continue
                        starting new iterations of realigning and tree
                        searching. If the number is less than 0, then no time
                        limit will be used. [default: disabled]
    --time-without-imp-limit=#.#
                        Maximum time (in seconds) since the last improvement
                        in score that PASTA will continue starting new
                        iterations of realigning and tree searching. If the
                        number is less than 0, then no time limit will be
                        used. [default: disabled]

  SATe tools options:
    --aligner=ALIGNER   The name of the alignment program to use for
                        subproblems. [default: mafft]
    --merger=MERGER     The name of the alignment program to use to merge
                        subproblems. [default: OPAL]
    --tree-estimator=TREE_ESTIMATOR
                        The name of the tree inference program to use to find
                        trees on fixed alignments. [default: fasttree]

Installation

  • Version 1.8.2, [2]

System

64-bit Linux