PASTA-Teaching
Category
Bioinformatics
Program On
Teaching
Version
1.8.2
Author / Distributor
Description
PASTA (Practical Alignment using SATe and Transitivity.
Running Program
Also refer to Running Jobs on the teaching cluster
For more information on Environment Modules please see the Lmod page.
- Version 1.8.2, installed in /usr/local/apps/gb/PASTA/1.8.2
To use this version of PASTA, please first load the module with
module load PASTA/1.8.2-foss-2016b-Python-2.7.14
This module sets up the path to use PASTA with Python 2.7.14
Sample job submission script (sub.sh) to run run_pasta.py
#!/bin/bash
#SBATCH --job-name=j_Pasta
#SBATCH --partition=batch
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@uga.edu
#SBATCH --ntasks=1
#SBATCH --mem=10gb
#SBATCH --time=4:00:00
#SBATCH --output=Pasta.%j.out
#SBATCH --error=Pasta.%j.err
cd $SLURM_SUBMIT_DIR
module load PASTA/1.8.2-foss-2016b-Python-2.7.14
run_pasta.py [options]
In the real submission script, at least all the above underlined values need to be reviewed or to be replaced by the proper values.
Please refer to Running_Jobs_on_the_teaching_cluster, Run X window Jobs and Run interactive Jobs for more details of running jobs at Teaching cluster.
Here is an example of job submission command:
sbatch ./sub.sh
Documentation
[pakala@teach ~]$ module load PASTA/1.8.2-foss-2016b-Python-2.7.14 [pakala@teach ~]$ run_pasta.py -h Usage: run_pasta.py [options] <settings_file1> <settings_file2> ... PASTA performs iterative realignment and tree inference, similar to SATe, but uses a very different merge algorithm which improves running time, memory usage, and accuracy. The current code is heavily based on SATe, with lots of modifications, many related to algorithmic differences between PASTA and SATe, but also many scalability improvements (parallelization, tree parsing, defaults, etc.) Minimally you must provide a sequence file (with the '--input' option); a starting tree is optional. By default, important algorithmic parameters are set based on automatic rules. The command line allows you to alter the behavior of the algorithm (termination criteria, when the algorithm switches to "Blind" acceptance of new alignments, how the tree is decomposed to find subproblems to be used, and the external tools to use). Options can also be passed in as configuration files. With the format: #################################################### [commandline] option-name = value [sate] option-name = value #################################################### With every run, PASTA saves the configuration file for that run as a temporary file called [jobname]_temp_pasta_config.txt in your output directory. Configuration files are read in the order they occur as arguments (with values in later files replacing previously read values). Options specified in the command line are read last. Thus these values "overwrite" any settings from the configuration files. Note that the use of --auto option can overwrite some of the other options provided by commandline or through configuration files. Options: --version show program's version number and exit -h, --help show this help message and exit commandline options: -a, --aligned If used, then the input file be will treated as aligned for the purposes of the first round of tree inference (the algorithm will start with tree searching on the input before re-aligning). This option only applies if a starting tree is NOT given. --auto This option is mostly for backward compatibility. If used, then automatically identified default values for the max_subproblem_size, number of cpus, tools, breaking strategy, masking criteria, and stopping criteria will be used. This is just like using the default options. However, [WARNING] when auto option is used PASTA overrides the value of these options even if you have supplied them; we recommend that you run this option with --exportconfig to see the exact set of options that will be used in your analysis. -d DATATYPE, --datatype=DATATYPE Specify DNA, RNA, or Protein to indicate what type of data is specified. Note that this option is NOT automatically determined [default: dna] --exportconfig=EXPORTCONFIG Export the configuration to the specified file and exit. This is useful if you want to combine several configurations and command line settings into a single configuration file to be used in other analyses. -i INPUT, --input=INPUT input sequence file -j JOB, --job=JOB job name [pastajob] --keepalignmenttemps Keep even the realignment temporary running files (this only has an effect if keeptemp is also selected). -k, --keeptemp Keep temporary running files? [default: disabled] --missing=MISSING How to deal with missing data symbols. Specify either "Ambiguous" or "Absent" if the input data contains ?-symbols -m, --multilocus Analyze multi-locus data? NOT SUPPORTED IN CURRENT PASTA version. --raxml-search-after If used, the completion of the PASTA algorithm will be followed by a tree search using RAxML on the masked alignment. This can be useful if a very fast and approximate tree estimator is used during the PASTA algorithm. [default: disabled] --temporaries=TEMPORARIES directory that will be the parent for this job's temporary file [default in PASTA home] --timesfile=TIMESFILE optional file that will store the times of events during the PASTA run. If the file exists, new lines will be -t TREEFILE, --treefile=TREEFILE starting tree file --two-phase If used, then the program will not perform the PASTA algorithm. Instead it will simply call the sequence aligner to align the entire dataset then will call the tree estimator to obtain the tree. --untrusted If used, then the data in the input file will be parsed using a more careful procedure. This will generate more helpful error messages, but will use more memory and be much slower for large inputs. If this option is omitted, the error messages resulting from invalid input data will be more cryptic. SATe acceptance options: --blind-after-iter-without-imp=# Maximum number of iterations without an improvement in likelihood score that PASTA will run before switching to blind mode. [default: disabled] --blind-after-time-without-imp=#.# Maximum time (in seconds) that PASTA will run without an improvement in likelihood score before switching to blind mode. [default: disabled] --blind-after-total-iter=# Maximum number of iterations that PASTA will run before switching to blind mode. [default: 0] --blind-after-total-time=#.# Maximum time (in seconds) that PASTA will run before switching to blind mode. [default: disabled] --no-blind-mode-is-final When the blind mode is final, then PASTA will never leave blind mode once it is has entered blind mode. --move-to-blind-on-worse-score If True then PASTA will move to the blind mode as soon it encounters a tree/alignment pair with a worse score. This is essentially the same as running in blind mode from the beginning, but it does allow one to terminate a run at an interval from the first time the algorithm fails to improve the score. SATe decomposition options: --break-strategy=BREAK_STRATEGY The method for choosing an edge when bisecting the tree during decomposition [default: mincluster] --max-subproblem-frac=#.# The maximum size (number of leaves) of subproblems specified in terms as a proportion of the total number of leaves. When a subproblem contains this number of leaves (or fewer), then it will not be decomposed further. [default: automatically picked based on alignment size] --max-subproblem-size=# The maximum size (number of leaves) of subproblems. When a subproblem contains this number of leaves (or fewer), then it will not be decomposed further. [default: automatically picked based on alignment size] --max-subtree-diameter=#.# The maximum diameter of each subtree. [default: 2.5] --min-subproblem-size=# The minimum size (number of leaves) of subproblems. [default: 0] SATe output options: -o OUTPUT_DIRECTORY, --output-directory=OUTPUT_DIRECTORY directory for output files (defaults to input file directory) --no-return-final-tree-and-alignment Return the best likelihood tree and alignment pair instead of those from the last iteration; this is discouraged with masking option enabled. SATe platform options: --max-mem-mb=# The maximum memory available to OPAL (for the Java heap size when running Java tools). --num-cpus=# The number of processing cores that you would like to assign to PASTA. This number should not exceed the number of cores on your machine. [default: number of cores available on the machine] SATe searching options: --mask-gappy-sites=# The minimum number of non-gap characters required in each column passed to the tree estimation step. Columns with fewer non-gap characters than the given threshold will be masked out before passing the alignment into the tree estimation module. These columns will be present in the final alignment. [default: 0.1% of alignment size] --start-tree-search-from-current If selected that the tree from the previous iteration will be given to the tree searching tool as a starting tree. SATe spaning-tree options: --build-MST Construct the spanning tree using minimum spanning tree algorithm [default: False] SATe termination options: --after-blind-iter-term-limit=# The maximum number of iteration that the PASTA algorithm will run after PASTA has entered blind mode. If the number is less than 1, then no iteration limit will be used. [default: disabled] --after-blind-iter-without-imp-limit=# The maximum number of iterations without an improvement in score that the PASTA algorithm will run after entering BLIND mode. If the number is less than 1, then no iteration limit will be used. [default: disabled] --after-blind-time-term-limit=#.# Maximum time (in seconds) that PASTA will continue starting new iterations of realigning and tree searching after PASTA has entered blind mode. If the number is less than 0, then no time limit will be used. [default: disabled] --after-blind-time-without-imp-limit=#.# Maximum time (in seconds) since the last improvement in score that PASTA will continue starting new iterations of realigning and tree searching after entering BLIND mode. If the number is less than 0, then no time limit will be used. [default: disabled] --iter-limit=# The maximum number of iteration that the PASTA algorithm will run. If the number is less than 1, then no iteration limit will be used. [default: 3] --iter-without-imp-limit=# The maximum number of iterations without an improvement in score that the PASTA algorithm will run. If the number is less than 1, then no iteration limit will be used. [default: disabled] --time-limit=#.# Maximum time (in seconds) that PASTA will continue starting new iterations of realigning and tree searching. If the number is less than 0, then no time limit will be used. [default: disabled] --time-without-imp-limit=#.# Maximum time (in seconds) since the last improvement in score that PASTA will continue starting new iterations of realigning and tree searching. If the number is less than 0, then no time limit will be used. [default: disabled] SATe tools options: --aligner=ALIGNER The name of the alignment program to use for subproblems. [default: mafft] --merger=MERGER The name of the alignment program to use to merge subproblems. [default: OPAL] --tree-estimator=TREE_ESTIMATOR The name of the tree inference program to use to find trees on fixed alignments. [default: fasttree]
Installation
- Version 1.8.2, [2]
System
64-bit Linux