PASTA-Teaching
Category
Bioinformatics
Program On
Teaching
Version
1.8.2
Author / Distributor
Description
PASTA (Practical Alignment using SATe and Transitivity.
Running Program
Also refer to Running Jobs on the teaching cluster
For more information on Environment Modules please see the Lmod page.
- Version 1.8.2, installed in /usr/local/apps/gb/PASTA/1.8.2
To use this version of PASTA, please first load the module with
module load PASTA/1.8.2-foss-2016b-Python-2.7.14
This module sets up the path to use PASTA with Python 2.7.14
Sample job submission script (sub.sh) to run run_pasta.py
#!/bin/bash
#SBATCH --job-name=j_Pasta
#SBATCH --partition=batch
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@uga.edu
#SBATCH --ntasks=1
#SBATCH --mem=10gb
#SBATCH --time=4:00:00
#SBATCH --output=Pasta.%j.out
#SBATCH --error=Pasta.%j.err
cd $SLURM_SUBMIT_DIR
module load PASTA/1.8.2-foss-2016b-Python-2.7.14
run_pasta.py [options]
In the real submission script, at least all the above underlined values need to be reviewed or to be replaced by the proper values.
Please refer to Running_Jobs_on_the_teaching_cluster, Run X window Jobs and Run interactive Jobs for more details of running jobs at Teaching cluster.
Here is an example of job submission command:
sbatch ./sub.sh
Documentation
[pakala@teach ~]$ module load PASTA/1.8.2-foss-2016b-Python-2.7.14
[pakala@teach ~]$ run_pasta.py -h
Usage: run_pasta.py [options] <settings_file1> <settings_file2> ...
PASTA performs iterative realignment and tree inference, similar to SATe, but
uses a very different merge algorithm which improves running time, memory
usage, and accuracy. The current code is heavily based on SATe, with lots of
modifications, many related to algorithmic differences between PASTA and SATe,
but also many scalability improvements (parallelization, tree parsing,
defaults, etc.)
Minimally you must provide a sequence file (with the '--input' option); a
starting tree is optional. By default, important algorithmic parameters are
set based on automatic rules.
The command line allows you to alter the behavior of the algorithm
(termination criteria, when the algorithm switches to "Blind" acceptance of
new alignments, how the tree is decomposed to find subproblems to be used, and
the external tools to use).
Options can also be passed in as configuration files.
With the format:
####################################################
[commandline]
option-name = value
[sate]
option-name = value
####################################################
With every run, PASTA saves the configuration file for that run as a temporary
file called [jobname]_temp_pasta_config.txt in your output directory.
Configuration files are read in the order they occur as arguments (with values
in later files replacing previously read values). Options specified in the
command line are read last. Thus these values "overwrite" any settings from
the configuration files. Note that the use of --auto option can overwrite some
of the other options provided by commandline or through configuration files.
Options:
--version show program's version number and exit
-h, --help show this help message and exit
commandline options:
-a, --aligned If used, then the input file be will treated as
aligned for the purposes of the first round of tree
inference (the algorithm will start with tree
searching on the input before re-aligning). This
option only applies if a starting tree is NOT given.
--auto This option is mostly for backward compatibility. If
used, then automatically identified default values for
the max_subproblem_size, number of cpus, tools,
breaking strategy, masking criteria, and stopping
criteria will be used. This is just like using the
default options. However, [WARNING] when auto option
is used PASTA overrides the value of these options
even if you have supplied them; we recommend that you
run this option with --exportconfig to see the exact
set of options that will be used in your analysis.
-d DATATYPE, --datatype=DATATYPE
Specify DNA, RNA, or Protein to indicate what type of
data is specified. Note that this option is NOT
automatically determined [default: dna]
--exportconfig=EXPORTCONFIG
Export the configuration to the specified file and
exit. This is useful if you want to combine several
configurations and command line settings into a single
configuration file to be used in other analyses.
-i INPUT, --input=INPUT
input sequence file
-j JOB, --job=JOB job name [pastajob]
--keepalignmenttemps
Keep even the realignment temporary running files
(this only has an effect if keeptemp is also
selected).
-k, --keeptemp Keep temporary running files? [default: disabled]
--missing=MISSING How to deal with missing data symbols. Specify either
"Ambiguous" or "Absent" if the input data contains
?-symbols
-m, --multilocus Analyze multi-locus data? NOT SUPPORTED IN CURRENT
PASTA version.
--raxml-search-after
If used, the completion of the PASTA algorithm will be
followed by a tree search using RAxML on the masked
alignment. This can be useful if a very fast and
approximate tree estimator is used during the PASTA
algorithm. [default: disabled]
--temporaries=TEMPORARIES
directory that will be the parent for this job's
temporary file [default in PASTA home]
--timesfile=TIMESFILE
optional file that will store the times of events
during the PASTA run. If the file exists, new lines
will be
-t TREEFILE, --treefile=TREEFILE
starting tree file
--two-phase If used, then the program will not perform the PASTA
algorithm. Instead it will simply call the sequence
aligner to align the entire dataset then will call the
tree estimator to obtain the tree.
--untrusted If used, then the data in the input file will be
parsed using a more careful procedure. This will
generate more helpful error messages, but will use
more memory and be much slower for large inputs. If
this option is omitted, the error messages resulting
from invalid input data will be more cryptic.
SATe acceptance options:
--blind-after-iter-without-imp=#
Maximum number of iterations without an improvement in
likelihood score that PASTA will run before switching
to blind mode. [default: disabled]
--blind-after-time-without-imp=#.#
Maximum time (in seconds) that PASTA will run without
an improvement in likelihood score before switching to
blind mode. [default: disabled]
--blind-after-total-iter=#
Maximum number of iterations that PASTA will run
before switching to blind mode. [default: 0]
--blind-after-total-time=#.#
Maximum time (in seconds) that PASTA will run before
switching to blind mode. [default: disabled]
--no-blind-mode-is-final
When the blind mode is final, then PASTA will never
leave blind mode once it is has entered blind mode.
--move-to-blind-on-worse-score
If True then PASTA will move to the blind mode as soon
it encounters a tree/alignment pair with a worse
score. This is essentially the same as running in
blind mode from the beginning, but it does allow one
to terminate a run at an interval from the first time
the algorithm fails to improve the score.
SATe decomposition options:
--break-strategy=BREAK_STRATEGY
The method for choosing an edge when bisecting the
tree during decomposition [default: mincluster]
--max-subproblem-frac=#.#
The maximum size (number of leaves) of subproblems
specified in terms as a proportion of the total number
of leaves. When a subproblem contains this number of
leaves (or fewer), then it will not be decomposed
further. [default: automatically picked based on
alignment size]
--max-subproblem-size=#
The maximum size (number of leaves) of subproblems.
When a subproblem contains this number of leaves (or
fewer), then it will not be decomposed further.
[default: automatically picked based on alignment
size]
--max-subtree-diameter=#.#
The maximum diameter of each subtree. [default: 2.5]
--min-subproblem-size=#
The minimum size (number of leaves) of subproblems.
[default: 0]
SATe output options:
-o OUTPUT_DIRECTORY, --output-directory=OUTPUT_DIRECTORY
directory for output files (defaults to input file
directory)
--no-return-final-tree-and-alignment
Return the best likelihood tree and alignment pair
instead of those from the last iteration; this is
discouraged with masking option enabled.
SATe platform options:
--max-mem-mb=# The maximum memory available to OPAL (for the Java
heap size when running Java tools).
--num-cpus=# The number of processing cores that you would like to
assign to PASTA. This number should not exceed the
number of cores on your machine. [default: number of
cores available on the machine]
SATe searching options:
--mask-gappy-sites=#
The minimum number of non-gap characters required in
each column passed to the tree estimation step.
Columns with fewer non-gap characters than the given
threshold will be masked out before passing the
alignment into the tree estimation module. These
columns will be present in the final alignment.
[default: 0.1% of alignment size]
--start-tree-search-from-current
If selected that the tree from the previous iteration
will be given to the tree searching tool as a starting
tree.
SATe spaning-tree options:
--build-MST Construct the spanning tree using minimum spanning
tree algorithm [default: False]
SATe termination options:
--after-blind-iter-term-limit=#
The maximum number of iteration that the PASTA
algorithm will run after PASTA has entered blind mode.
If the number is less than 1, then no iteration limit
will be used. [default: disabled]
--after-blind-iter-without-imp-limit=#
The maximum number of iterations without an
improvement in score that the PASTA algorithm will run
after entering BLIND mode. If the number is less than
1, then no iteration limit will be used. [default:
disabled]
--after-blind-time-term-limit=#.#
Maximum time (in seconds) that PASTA will continue
starting new iterations of realigning and tree
searching after PASTA has entered blind mode. If the
number is less than 0, then no time limit will be
used. [default: disabled]
--after-blind-time-without-imp-limit=#.#
Maximum time (in seconds) since the last improvement
in score that PASTA will continue starting new
iterations of realigning and tree searching after
entering BLIND mode. If the number is less than 0,
then no time limit will be used. [default: disabled]
--iter-limit=# The maximum number of iteration that the PASTA
algorithm will run. If the number is less than 1,
then no iteration limit will be used. [default: 3]
--iter-without-imp-limit=#
The maximum number of iterations without an
improvement in score that the PASTA algorithm will
run. If the number is less than 1, then no iteration
limit will be used. [default: disabled]
--time-limit=#.# Maximum time (in seconds) that PASTA will continue
starting new iterations of realigning and tree
searching. If the number is less than 0, then no time
limit will be used. [default: disabled]
--time-without-imp-limit=#.#
Maximum time (in seconds) since the last improvement
in score that PASTA will continue starting new
iterations of realigning and tree searching. If the
number is less than 0, then no time limit will be
used. [default: disabled]
SATe tools options:
--aligner=ALIGNER The name of the alignment program to use for
subproblems. [default: mafft]
--merger=MERGER The name of the alignment program to use to merge
subproblems. [default: OPAL]
--tree-estimator=TREE_ESTIMATOR
The name of the tree inference program to use to find
trees on fixed alignments. [default: fasttree]
Installation
- Version 1.8.2, [2]
System
64-bit Linux