PartitionFinder-Teaching: Difference between revisions
(Created page with "Category:Sapelo2Category:SoftwareCategory:Bioinformatics === Category === Bioinformatics === Program On === Sapelo2 === Version === 2.1.1 === Author / Di...") |
No edit summary |
||
Line 1: | Line 1: | ||
[[Category: | [[Category:Teaching]][[Category:Software]][[Category:Bioinformatics]] | ||
=== Category === | === Category === | ||
Line 6: | Line 6: | ||
=== Program On === | === Program On === | ||
Teaching | |||
=== Version === | === Version === |
Revision as of 13:31, 1 April 2019
Category
Bioinformatics
Program On
Teaching
Version
2.1.1
Author / Distributor
Details are at Partitionfinder
Description
"PartitionFinder is free open source software to select best-fit partitioning schemes and models of molecular evolution for phylogenetic analyses.". More detailes are at Partitionfinder.
"PartitionFinder is a Python program for simultaneously choosing partitioning schemes and models of molecular evolution for phylogenetic analyses of DNA, protein, and morphological data.". More detailes are at Partitionfinder Github
Running Program
Also refer to Running Jobs on Sapelo2
For more information on Environment Modules on Sapelo2 please see the Lmod page.
- Version 2.1.1, installed in /usr/local/apps/eb/PartitionFinder/2.1.1-foss-2016b-Python-2.7.14.
To use this version of PartitionFinder scripts, please first load the module with
ml PartitionFinder/2.1.1-foss-2016b-Python-2.7.14
Here is an example of a shell script, sub.sh, to run on the batch queue:
#!/bin/bash
#SBATCH --job-name=j_PartitionFinder
#SBATCH --partition=batch
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@uga.edu
#SBATCH --ntasks=1
#SBATCH --mem=10gb
#SBATCH --time=08:00:00
#SBATCH --output=PartitionFinder.%j.out
#SBATCH --error=PartitionFinder.%j.err
cd $SLURM_SUBMIT_DIR
ml PartitionFinder/2.1.1-foss-2016b-Python-2.7.14
time python $EBROOTPARTITIONFINDER/PartitionFinder.py -p 4 [options]
where EBROOTPARTITIONFINDER is the environmental variable storing PartitionFinder installation path on cluster (i.e. /usr/local/apps/eb/PartitionFinder/2.1.1-foss-2016b-Python-2.7.14); [options] need to be replaced by the options (command and arguments) you want to use. In the real submission script, at least all the above underlined values need to be reviewed or to be replaced by the proper values.
Please refer to Running_Jobs_on_the_teaching_cluster, Run X window Jobs and Run interactive Jobs for more details of running jobs at Teaching cluster.
Here is an example of job submission command:
sbatch ./sub.sh
Documentation
Details are at PartitionFinder
ml PartitionFinder/2.1.1-foss-2016b-Python-2.7.14 python $EBROOTPARTITIONFINDER/PartitionFinder.py --help Usage: python PartitionFinder.py [options] <foldername> PartitionFinder and PartitionFinderProtein are designed to discover optimal partitioning schemes for nucleotide and amino acid sequence alignments. They are also useful for finding the best model of sequence evolution for datasets. The Input: <foldername>: the full path to a folder containing: - A configuration file (partition_finder.cfg) - A nucleotide/aa alignment in Phylip format Take a look at the included 'example' folder for more details. The Output: A file in the same directory as the .cfg file, named 'analysis' This file contains information on the best partitioning scheme, and the best model for each partiiton Usage Examples: >python PartitionFinder.py example Analyse what is in the 'example' sub-folder in the current folder. >python PartitionFinder.py -v example Analyse what is in the 'example' sub-folder in the current folder, but show all the debug output >python PartitionFinder.py -c ~/data/frogs Check the configuration files in the folder data/frogs in the current user's home folder. >python PartitionFinder.py --force-restart ~/data/frogs Deletes any data produced by the previous runs (which is in ~/data/frogs/output) and starts afresh Options: -h, --help show this help message and exit -v, --verbose show debug logging information (equivalent to --debug- out=all) -c, --check-only just check the configuration files, don't do any processing -f, --force-restart delete all previous output and start afresh (!) -p N, --processes=N Number of concurrent processes to use. Use -1 to match the number of cpus on the machine. The default is to use -1. --show-python-exceptions If errors occur, print the python exceptions --save-phylofiles save all of the phyml or raxml output. This can take a lot of space(!) --dump-results Dump all results to a binary file. This is only of use for testing purposes. --compare-results Compare the results to previously dumped binary results. This is only of use for testing purposes. -q, --quick Avoid anything slow (like writing schemes at each step),useful for very large datasets. -r, --raxml Use RAxML (rather than PhyML) to do the analysis. See the manual -n, --no-ml-tree Estimate a starting tree with NJ (PhyML) or MP (RaxML) instead of the default which is to estimate a starting tree with ML using in RAxML. Not recommended. --cmdline-extras=N Add additional commands to the phyml or raxml commandlines that PF uses.This can be useful e.g. if you want to change the accuracy of lnL calculations ('-e' option in raxml), or use multi-threaded versions of raxml that require you to specify the number of threads you will let raxml use ('-T' option in raxml. E.g. you might specify this: --cmndline_extras ' -e 2.0 -T 10 ' N.B. MAKE SURE YOU PUT YOUR EXTRAS IN QUOTES, and only use this command if you really know what you're doing and are very familiar with raxml and PartitionFinder --weights=N Mainly for algorithm development. Only use it if you know what you're doing.A list of weights to use in the clustering algorithms. This list allows you to assign different weights to: the overall rate for a subset, the base/amino acid frequencies, model parameters, and alpha value. This will affect how subsets are clustered together. For instance: --cluster_weights '1, 2, 5, 1', would weight the base freqeuncies 2x more than the overall rate, the model parameters 5x more, and the alpha parameter the same as the model rate --kmeans=type This defines which sitewise values to use: entropy or tiger --kmeans entropy: use entropies for sitewise values --kmeans tiger: use TIGER rates for sitewise values (only valid for Morphology) --rcluster-percent=N This defines the proportion of possible schemes that the relaxed clustering algorithm will consider before it stops looking. The default is 10%. e.g. --rcluster- percent 10.0 --rcluster-max=N This defines the number of possible schemes that the relaxed clustering algorithm will consider before it stops looking. The default is to look at the larger value out of 1000, and 10 times the number of data blocks you have. e.g. --rcluster-max 1000 --min-subset-size=N This defines the minimum subset size that the kmeans and rcluster algorithm will accept. Subsets smaller than this will be merged at with other subsets at the end of the algorithm (for kmeans) or at the start of the algorithm (for rcluster). See manual for details. The default value for kmeans is 100. The default value for rcluster is to ignore this option. e.g. --min- subset-size 100 --debug-output=REGION,REGION,... (advanced option) Provide a list of debug regions to output extra information about what the program is doing. Possible regions are 'all' or any of {subset,su bset_ops,raxml,parser,model_util,results,entropy,align ment,threadpool,progress,main,config,reporter,kmeans,n eighbour,morph_tige,analysis_m,util,scheme,submodels,d atabase,analysis,phyml,raxml_mode,model_load,phyml_mod e,sklearn}. --all-states In the kmeans and rcluster algorithms, this stipulates that PartitionFinder should not produce subsets that do not have all possible states present. E.g. for DNA sequence data, all subsets in the final scheme must have A, C, T, and G nucleotides present. This can occasionally be useful for downstream analyses, particularly concerning amino acid datasets. --profile Output profiling information after running (this will slow everything down!) python $EBROOTPARTITIONFINDER/PartitionFinderMorphology.py --help Usage: python PartitionFinderMorphology.py [options] <foldername> PartitionFinder and PartitionFinderProtein are designed to discover optimal partitioning schemes for nucleotide and amino acid sequence alignments. They are also useful for finding the best model of sequence evolution for datasets. The Input: <foldername>: the full path to a folder containing: - A configuration file (partition_finder.cfg) - A nucleotide/aa alignment in Phylip format Take a look at the included 'example' folder for more details. The Output: A file in the same directory as the .cfg file, named 'analysis' This file contains information on the best partitioning scheme, and the best model for each partiiton Usage Examples: >python PartitionFinderMorphology.py example Analyse what is in the 'example' sub-folder in the current folder. >python PartitionFinderMorphology.py -v example Analyse what is in the 'example' sub-folder in the current folder, but show all the debug output >python PartitionFinderMorphology.py -c ~/data/frogs Check the configuration files in the folder data/frogs in the current user's home folder. >python PartitionFinderMorphology.py --force-restart ~/data/frogs Deletes any data produced by the previous runs (which is in ~/data/frogs/output) and starts afresh Options: -h, --help show this help message and exit -v, --verbose show debug logging information (equivalent to --debug- out=all) -c, --check-only just check the configuration files, don't do any processing -f, --force-restart delete all previous output and start afresh (!) -p N, --processes=N Number of concurrent processes to use. Use -1 to match the number of cpus on the machine. The default is to use -1. --show-python-exceptions If errors occur, print the python exceptions --save-phylofiles save all of the phyml or raxml output. This can take a lot of space(!) --dump-results Dump all results to a binary file. This is only of use for testing purposes. --compare-results Compare the results to previously dumped binary results. This is only of use for testing purposes. -q, --quick Avoid anything slow (like writing schemes at each step),useful for very large datasets. -r, --raxml Use RAxML (rather than PhyML) to do the analysis. See the manual -n, --no-ml-tree Estimate a starting tree with NJ (PhyML) or MP (RaxML) instead of the default which is to estimate a starting tree with ML using in RAxML. Not recommended. --cmdline-extras=N Add additional commands to the phyml or raxml commandlines that PF uses.This can be useful e.g. if you want to change the accuracy of lnL calculations ('-e' option in raxml), or use multi-threaded versions of raxml that require you to specify the number of threads you will let raxml use ('-T' option in raxml. E.g. you might specify this: --cmndline_extras ' -e 2.0 -T 10 ' N.B. MAKE SURE YOU PUT YOUR EXTRAS IN QUOTES, and only use this command if you really know what you're doing and are very familiar with raxml and PartitionFinder --weights=N Mainly for algorithm development. Only use it if you know what you're doing.A list of weights to use in the clustering algorithms. This list allows you to assign different weights to: the overall rate for a subset, the base/amino acid frequencies, model parameters, and alpha value. This will affect how subsets are clustered together. For instance: --cluster_weights '1, 2, 5, 1', would weight the base freqeuncies 2x more than the overall rate, the model parameters 5x more, and the alpha parameter the same as the model rate --kmeans=type This defines which sitewise values to use: entropy or tiger --kmeans entropy: use entropies for sitewise values --kmeans tiger: use TIGER rates for sitewise values (only valid for Morphology) --rcluster-percent=N This defines the proportion of possible schemes that the relaxed clustering algorithm will consider before it stops looking. The default is 10%. e.g. --rcluster- percent 10.0 --rcluster-max=N This defines the number of possible schemes that the relaxed clustering algorithm will consider before it stops looking. The default is to look at the larger value out of 1000, and 10 times the number of data blocks you have. e.g. --rcluster-max 1000 --min-subset-size=N This defines the minimum subset size that the kmeans and rcluster algorithm will accept. Subsets smaller than this will be merged at with other subsets at the end of the algorithm (for kmeans) or at the start of the algorithm (for rcluster). See manual for details. The default value for kmeans is 100. The default value for rcluster is to ignore this option. e.g. --min- subset-size 100 --debug-output=REGION,REGION,... (advanced option) Provide a list of debug regions to output extra information about what the program is doing. Possible regions are 'all' or any of {subset,su bset_ops,raxml,parser,model_util,results,entropy,align ment,threadpool,progress,main,config,reporter,kmeans,n eighbour,morph_tige,analysis_m,util,scheme,submodels,d atabase,analysis,phyml,raxml_mode,model_load,phyml_mod e,sklearn}. --all-states In the kmeans and rcluster algorithms, this stipulates that PartitionFinder should not produce subsets that do not have all possible states present. E.g. for DNA sequence data, all subsets in the final scheme must have A, C, T, and G nucleotides present. This can occasionally be useful for downstream analyses, particularly concerning amino acid datasets. --profile Output profiling information after running (this will slow everything down!) python $EBROOTPARTITIONFINDER/PartitionFinderProtein.py --help Usage: python PartitionFinderProtein.py [options] <foldername> PartitionFinder and PartitionFinderProtein are designed to discover optimal partitioning schemes for nucleotide and amino acid sequence alignments. They are also useful for finding the best model of sequence evolution for datasets. The Input: <foldername>: the full path to a folder containing: - A configuration file (partition_finder.cfg) - A nucleotide/aa alignment in Phylip format Take a look at the included 'example' folder for more details. The Output: A file in the same directory as the .cfg file, named 'analysis' This file contains information on the best partitioning scheme, and the best model for each partiiton Usage Examples: >python PartitionFinderProtein.py example Analyse what is in the 'example' sub-folder in the current folder. >python PartitionFinderProtein.py -v example Analyse what is in the 'example' sub-folder in the current folder, but show all the debug output >python PartitionFinderProtein.py -c ~/data/frogs Check the configuration files in the folder data/frogs in the current user's home folder. >python PartitionFinderProtein.py --force-restart ~/data/frogs Deletes any data produced by the previous runs (which is in ~/data/frogs/output) and starts afresh Options: -h, --help show this help message and exit -v, --verbose show debug logging information (equivalent to --debug- out=all) -c, --check-only just check the configuration files, don't do any processing -f, --force-restart delete all previous output and start afresh (!) -p N, --processes=N Number of concurrent processes to use. Use -1 to match the number of cpus on the machine. The default is to use -1. --show-python-exceptions If errors occur, print the python exceptions --save-phylofiles save all of the phyml or raxml output. This can take a lot of space(!) --dump-results Dump all results to a binary file. This is only of use for testing purposes. --compare-results Compare the results to previously dumped binary results. This is only of use for testing purposes. -q, --quick Avoid anything slow (like writing schemes at each step),useful for very large datasets. -r, --raxml Use RAxML (rather than PhyML) to do the analysis. See the manual -n, --no-ml-tree Estimate a starting tree with NJ (PhyML) or MP (RaxML) instead of the default which is to estimate a starting tree with ML using in RAxML. Not recommended. --cmdline-extras=N Add additional commands to the phyml or raxml commandlines that PF uses.This can be useful e.g. if you want to change the accuracy of lnL calculations ('-e' option in raxml), or use multi-threaded versions of raxml that require you to specify the number of threads you will let raxml use ('-T' option in raxml. E.g. you might specify this: --cmndline_extras ' -e 2.0 -T 10 ' N.B. MAKE SURE YOU PUT YOUR EXTRAS IN QUOTES, and only use this command if you really know what you're doing and are very familiar with raxml and PartitionFinder --weights=N Mainly for algorithm development. Only use it if you know what you're doing.A list of weights to use in the clustering algorithms. This list allows you to assign different weights to: the overall rate for a subset, the base/amino acid frequencies, model parameters, and alpha value. This will affect how subsets are clustered together. For instance: --cluster_weights '1, 2, 5, 1', would weight the base freqeuncies 2x more than the overall rate, the model parameters 5x more, and the alpha parameter the same as the model rate --kmeans=type This defines which sitewise values to use: entropy or tiger --kmeans entropy: use entropies for sitewise values --kmeans tiger: use TIGER rates for sitewise values (only valid for Morphology) --rcluster-percent=N This defines the proportion of possible schemes that the relaxed clustering algorithm will consider before it stops looking. The default is 10%. e.g. --rcluster- percent 10.0 --rcluster-max=N This defines the number of possible schemes that the relaxed clustering algorithm will consider before it stops looking. The default is to look at the larger value out of 1000, and 10 times the number of data blocks you have. e.g. --rcluster-max 1000 --min-subset-size=N This defines the minimum subset size that the kmeans and rcluster algorithm will accept. Subsets smaller than this will be merged at with other subsets at the end of the algorithm (for kmeans) or at the start of the algorithm (for rcluster). See manual for details. The default value for kmeans is 100. The default value for rcluster is to ignore this option. e.g. --min- subset-size 100 --debug-output=REGION,REGION,... (advanced option) Provide a list of debug regions to output extra information about what the program is doing. Possible regions are 'all' or any of {subset,su bset_ops,raxml,parser,model_util,results,entropy,align ment,threadpool,progress,main,config,reporter,kmeans,n eighbour,morph_tige,analysis_m,util,scheme,submodels,d atabase,analysis,phyml,raxml_mode,model_load,phyml_mod e,sklearn}. --all-states In the kmeans and rcluster algorithms, this stipulates that PartitionFinder should not produce subsets that do not have all possible states present. E.g. for DNA sequence data, all subsets in the final scheme must have A, C, T, and G nucleotides present. This can occasionally be useful for downstream analyses, particularly concerning amino acid datasets. --profile Output profiling information after running (this will slow everything down!)
Installation
source code from PartitionFinder Github Releases
System
64-bit Linux