PartitionFinder-Teaching

From Research Computing Center Wiki
Jump to navigation Jump to search

Category

Bioinformatics

Program On

Teaching

Version

2.1.1

Author / Distributor

Details are at Partitionfinder

Description

"PartitionFinder is free open source software to select best-fit partitioning schemes and models of molecular evolution for phylogenetic analyses.". More detailes are at Partitionfinder.

"PartitionFinder is a Python program for simultaneously choosing partitioning schemes and models of molecular evolution for phylogenetic analyses of DNA, protein, and morphological data.". More detailes are at Partitionfinder Github

Running Program

Also refer to Running Jobs on the teaching cluster

For more information on Environment Modules please see the Lmod page.

  • Version 2.1.1, installed in /usr/local/apps/eb/PartitionFinder/2.1.1-foss-2016b-Python-2.7.14.

To use this version of PartitionFinder scripts, please first load the module with

ml PartitionFinder/2.1.1-foss-2016b-Python-2.7.14

Here is an example of a shell script, sub.sh, to run on the batch queue:

#!/bin/bash
#SBATCH --job-name=j_PartitionFinder
#SBATCH --partition=batch
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@uga.edu
#SBATCH --ntasks=1
#SBATCH --mem=10gb
#SBATCH --time=08:00:00
#SBATCH --output=PartitionFinder.%j.out
#SBATCH --error=PartitionFinder.%j.err

cd $SLURM_SUBMIT_DIR

ml PartitionFinder/2.1.1-foss-2016b-Python-2.7.14

time python $EBROOTPARTITIONFINDER/PartitionFinder.py -p 4 [options]

where EBROOTPARTITIONFINDER is the environmental variable storing PartitionFinder installation path on cluster (i.e. /usr/local/apps/eb/PartitionFinder/2.1.1-foss-2016b-Python-2.7.14); [options] need to be replaced by the options (command and arguments) you want to use. In the real submission script, at least all the above underlined values need to be reviewed or to be replaced by the proper values.

Please refer to Running_Jobs_on_the_teaching_cluster, Run X window Jobs and Run interactive Jobs for more details of running jobs at Teaching cluster.


Here is an example of job submission command:

sbatch ./sub.sh 

Documentation

Details are at PartitionFinder

 
ml PartitionFinder/2.1.1-foss-2016b-Python-2.7.14

python $EBROOTPARTITIONFINDER/PartitionFinder.py --help

Usage: python PartitionFinder.py [options] <foldername>

    PartitionFinder and PartitionFinderProtein are designed to discover optimal
    partitioning schemes for nucleotide and amino acid sequence alignments.
    They are also useful for finding the best model of sequence evolution for datasets.

    The Input: <foldername>: the full path to a folder containing:
        - A configuration file (partition_finder.cfg)
        - A nucleotide/aa alignment in Phylip format
    Take a look at the included 'example' folder for more details.

    The Output: A file in the same directory as the .cfg file, named
    'analysis' This file contains information on the best
    partitioning scheme, and the best model for each partiiton

    Usage Examples:
        >python PartitionFinder.py example
        Analyse what is in the 'example' sub-folder in the current folder.

        >python PartitionFinder.py -v example
        Analyse what is in the 'example' sub-folder in the current folder, but
        show all the debug output

        >python PartitionFinder.py -c ~/data/frogs
        Check the configuration files in the folder data/frogs in the current
        user's home folder.

        >python PartitionFinder.py --force-restart ~/data/frogs
        Deletes any data produced by the previous runs (which is in
        ~/data/frogs/output) and starts afresh
    

Options:
  -h, --help            show this help message and exit
  -v, --verbose         show debug logging information (equivalent to --debug-
                        out=all)
  -c, --check-only      just check the configuration files, don't do any
                        processing
  -f, --force-restart   delete all previous output and start afresh (!)
  -p N, --processes=N   Number of concurrent processes to use. Use -1 to match
                        the number of cpus on the machine. The default is to
                        use -1.
  --show-python-exceptions
                        If errors occur, print the python exceptions
  --save-phylofiles     save all of the phyml or raxml output. This can take a
                        lot of space(!)
  --dump-results        Dump all results to a binary file. This is only of use
                        for testing purposes.
  --compare-results     Compare the results to previously dumped binary
                        results. This is only of use for testing purposes.
  -q, --quick           Avoid anything slow (like writing schemes at each
                        step),useful for very large datasets.
  -r, --raxml           Use RAxML (rather than PhyML) to do the analysis. See
                        the manual
  -n, --no-ml-tree      Estimate a starting tree with NJ (PhyML) or MP (RaxML)
                        instead of the default which is to estimate a starting
                        tree with ML  using in RAxML. Not recommended.
  --cmdline-extras=N    Add additional commands to the phyml or raxml
                        commandlines that PF uses.This can be useful e.g. if
                        you want to change the accuracy of lnL calculations
                        ('-e' option in raxml), or use multi-threaded versions
                        of raxml that require you to specify the number of
                        threads you will let raxml use ('-T' option in raxml.
                        E.g. you might specify this: --cmndline_extras ' -e
                        2.0 -T 10 ' N.B. MAKE SURE YOU PUT YOUR EXTRAS IN
                        QUOTES, and only use this command if you really know
                        what you're doing and are very familiar with raxml and
                        PartitionFinder
  --weights=N           Mainly for algorithm development. Only use it if you
                        know what you're doing.A list of weights to use in the
                        clustering algorithms. This list allows you to assign
                        different weights to: the overall rate for a subset,
                        the base/amino acid frequencies, model parameters, and
                        alpha value. This will affect how subsets are
                        clustered together. For instance: --cluster_weights
                        '1, 2, 5, 1', would weight the base freqeuncies 2x
                        more than the overall rate, the model parameters 5x
                        more, and the alpha parameter the same as the model
                        rate
  --kmeans=type         This defines which sitewise values to use: entropy or
                        tiger  --kmeans entropy: use entropies for sitewise
                        values --kmeans tiger: use TIGER rates for sitewise
                        values (only valid for Morphology)
  --rcluster-percent=N  This defines the proportion of possible schemes that
                        the relaxed clustering algorithm will consider before
                        it stops looking. The default is 10%. e.g. --rcluster-
                        percent 10.0
  --rcluster-max=N      This defines the number of possible schemes that the
                        relaxed clustering algorithm will consider before it
                        stops looking. The default is to look at the larger
                        value out of 1000, and 10 times the number of data
                        blocks you have. e.g. --rcluster-max 1000
  --min-subset-size=N   This defines the minimum subset size that the kmeans
                        and rcluster algorithm will accept. Subsets smaller
                        than this  will be merged at with other subsets at the
                        end of the algorithm (for kmeans) or at the start of
                        the algorithm (for rcluster). See manual for details.
                        The default value for kmeans is 100. The default value
                        for rcluster is to ignore this option. e.g. --min-
                        subset-size 100
  --debug-output=REGION,REGION,...
                        (advanced option) Provide a list of debug regions to
                        output extra information about what the program is
                        doing. Possible regions are 'all' or any of {subset,su
                        bset_ops,raxml,parser,model_util,results,entropy,align
                        ment,threadpool,progress,main,config,reporter,kmeans,n
                        eighbour,morph_tige,analysis_m,util,scheme,submodels,d
                        atabase,analysis,phyml,raxml_mode,model_load,phyml_mod
                        e,sklearn}.
  --all-states          In the kmeans and rcluster algorithms, this stipulates
                        that PartitionFinder should not produce subsets that
                        do not have all possible states present. E.g. for DNA
                        sequence data, all subsets in the final scheme must
                        have A, C, T,  and G nucleotides present. This can
                        occasionally be useful for downstream  analyses,
                        particularly concerning amino acid datasets.
  --profile             Output profiling information after running (this will
                        slow everything down!)


python $EBROOTPARTITIONFINDER/PartitionFinderMorphology.py --help

Usage: python PartitionFinderMorphology.py [options] <foldername>

    PartitionFinder and PartitionFinderProtein are designed to discover optimal
    partitioning schemes for nucleotide and amino acid sequence alignments.
    They are also useful for finding the best model of sequence evolution for datasets.

    The Input: <foldername>: the full path to a folder containing:
        - A configuration file (partition_finder.cfg)
        - A nucleotide/aa alignment in Phylip format
    Take a look at the included 'example' folder for more details.

    The Output: A file in the same directory as the .cfg file, named
    'analysis' This file contains information on the best
    partitioning scheme, and the best model for each partiiton

    Usage Examples:
        >python PartitionFinderMorphology.py example
        Analyse what is in the 'example' sub-folder in the current folder.

        >python PartitionFinderMorphology.py -v example
        Analyse what is in the 'example' sub-folder in the current folder, but
        show all the debug output

        >python PartitionFinderMorphology.py -c ~/data/frogs
        Check the configuration files in the folder data/frogs in the current
        user's home folder.

        >python PartitionFinderMorphology.py --force-restart ~/data/frogs
        Deletes any data produced by the previous runs (which is in
        ~/data/frogs/output) and starts afresh
    

Options:
  -h, --help            show this help message and exit
  -v, --verbose         show debug logging information (equivalent to --debug-
                        out=all)
  -c, --check-only      just check the configuration files, don't do any
                        processing
  -f, --force-restart   delete all previous output and start afresh (!)
  -p N, --processes=N   Number of concurrent processes to use. Use -1 to match
                        the number of cpus on the machine. The default is to
                        use -1.
  --show-python-exceptions
                        If errors occur, print the python exceptions
  --save-phylofiles     save all of the phyml or raxml output. This can take a
                        lot of space(!)
  --dump-results        Dump all results to a binary file. This is only of use
                        for testing purposes.
  --compare-results     Compare the results to previously dumped binary
                        results. This is only of use for testing purposes.
  -q, --quick           Avoid anything slow (like writing schemes at each
                        step),useful for very large datasets.
  -r, --raxml           Use RAxML (rather than PhyML) to do the analysis. See
                        the manual
  -n, --no-ml-tree      Estimate a starting tree with NJ (PhyML) or MP (RaxML)
                        instead of the default which is to estimate a starting
                        tree with ML  using in RAxML. Not recommended.
  --cmdline-extras=N    Add additional commands to the phyml or raxml
                        commandlines that PF uses.This can be useful e.g. if
                        you want to change the accuracy of lnL calculations
                        ('-e' option in raxml), or use multi-threaded versions
                        of raxml that require you to specify the number of
                        threads you will let raxml use ('-T' option in raxml.
                        E.g. you might specify this: --cmndline_extras ' -e
                        2.0 -T 10 ' N.B. MAKE SURE YOU PUT YOUR EXTRAS IN
                        QUOTES, and only use this command if you really know
                        what you're doing and are very familiar with raxml and
                        PartitionFinder
  --weights=N           Mainly for algorithm development. Only use it if you
                        know what you're doing.A list of weights to use in the
                        clustering algorithms. This list allows you to assign
                        different weights to: the overall rate for a subset,
                        the base/amino acid frequencies, model parameters, and
                        alpha value. This will affect how subsets are
                        clustered together. For instance: --cluster_weights
                        '1, 2, 5, 1', would weight the base freqeuncies 2x
                        more than the overall rate, the model parameters 5x
                        more, and the alpha parameter the same as the model
                        rate
  --kmeans=type         This defines which sitewise values to use: entropy or
                        tiger  --kmeans entropy: use entropies for sitewise
                        values --kmeans tiger: use TIGER rates for sitewise
                        values (only valid for Morphology)
  --rcluster-percent=N  This defines the proportion of possible schemes that
                        the relaxed clustering algorithm will consider before
                        it stops looking. The default is 10%. e.g. --rcluster-
                        percent 10.0
  --rcluster-max=N      This defines the number of possible schemes that the
                        relaxed clustering algorithm will consider before it
                        stops looking. The default is to look at the larger
                        value out of 1000, and 10 times the number of data
                        blocks you have. e.g. --rcluster-max 1000
  --min-subset-size=N   This defines the minimum subset size that the kmeans
                        and rcluster algorithm will accept. Subsets smaller
                        than this  will be merged at with other subsets at the
                        end of the algorithm (for kmeans) or at the start of
                        the algorithm (for rcluster). See manual for details.
                        The default value for kmeans is 100. The default value
                        for rcluster is to ignore this option. e.g. --min-
                        subset-size 100
  --debug-output=REGION,REGION,...
                        (advanced option) Provide a list of debug regions to
                        output extra information about what the program is
                        doing. Possible regions are 'all' or any of {subset,su
                        bset_ops,raxml,parser,model_util,results,entropy,align
                        ment,threadpool,progress,main,config,reporter,kmeans,n
                        eighbour,morph_tige,analysis_m,util,scheme,submodels,d
                        atabase,analysis,phyml,raxml_mode,model_load,phyml_mod
                        e,sklearn}.
  --all-states          In the kmeans and rcluster algorithms, this stipulates
                        that PartitionFinder should not produce subsets that
                        do not have all possible states present. E.g. for DNA
                        sequence data, all subsets in the final scheme must
                        have A, C, T,  and G nucleotides present. This can
                        occasionally be useful for downstream  analyses,
                        particularly concerning amino acid datasets.
  --profile             Output profiling information after running (this will
                        slow everything down!)


python $EBROOTPARTITIONFINDER/PartitionFinderProtein.py --help

Usage: python PartitionFinderProtein.py [options] <foldername>

    PartitionFinder and PartitionFinderProtein are designed to discover optimal
    partitioning schemes for nucleotide and amino acid sequence alignments.
    They are also useful for finding the best model of sequence evolution for datasets.

    The Input: <foldername>: the full path to a folder containing:
        - A configuration file (partition_finder.cfg)
        - A nucleotide/aa alignment in Phylip format
    Take a look at the included 'example' folder for more details.

    The Output: A file in the same directory as the .cfg file, named
    'analysis' This file contains information on the best
    partitioning scheme, and the best model for each partiiton

    Usage Examples:
        >python PartitionFinderProtein.py example
        Analyse what is in the 'example' sub-folder in the current folder.

        >python PartitionFinderProtein.py -v example
        Analyse what is in the 'example' sub-folder in the current folder, but
        show all the debug output

        >python PartitionFinderProtein.py -c ~/data/frogs
        Check the configuration files in the folder data/frogs in the current
        user's home folder.

        >python PartitionFinderProtein.py --force-restart ~/data/frogs
        Deletes any data produced by the previous runs (which is in
        ~/data/frogs/output) and starts afresh
    

Options:
  -h, --help            show this help message and exit
  -v, --verbose         show debug logging information (equivalent to --debug-
                        out=all)
  -c, --check-only      just check the configuration files, don't do any
                        processing
  -f, --force-restart   delete all previous output and start afresh (!)
  -p N, --processes=N   Number of concurrent processes to use. Use -1 to match
                        the number of cpus on the machine. The default is to
                        use -1.
  --show-python-exceptions
                        If errors occur, print the python exceptions
  --save-phylofiles     save all of the phyml or raxml output. This can take a
                        lot of space(!)
  --dump-results        Dump all results to a binary file. This is only of use
                        for testing purposes.
  --compare-results     Compare the results to previously dumped binary
                        results. This is only of use for testing purposes.
  -q, --quick           Avoid anything slow (like writing schemes at each
                        step),useful for very large datasets.
  -r, --raxml           Use RAxML (rather than PhyML) to do the analysis. See
                        the manual
  -n, --no-ml-tree      Estimate a starting tree with NJ (PhyML) or MP (RaxML)
                        instead of the default which is to estimate a starting
                        tree with ML  using in RAxML. Not recommended.
  --cmdline-extras=N    Add additional commands to the phyml or raxml
                        commandlines that PF uses.This can be useful e.g. if
                        you want to change the accuracy of lnL calculations
                        ('-e' option in raxml), or use multi-threaded versions
                        of raxml that require you to specify the number of
                        threads you will let raxml use ('-T' option in raxml.
                        E.g. you might specify this: --cmndline_extras ' -e
                        2.0 -T 10 ' N.B. MAKE SURE YOU PUT YOUR EXTRAS IN
                        QUOTES, and only use this command if you really know
                        what you're doing and are very familiar with raxml and
                        PartitionFinder
  --weights=N           Mainly for algorithm development. Only use it if you
                        know what you're doing.A list of weights to use in the
                        clustering algorithms. This list allows you to assign
                        different weights to: the overall rate for a subset,
                        the base/amino acid frequencies, model parameters, and
                        alpha value. This will affect how subsets are
                        clustered together. For instance: --cluster_weights
                        '1, 2, 5, 1', would weight the base freqeuncies 2x
                        more than the overall rate, the model parameters 5x
                        more, and the alpha parameter the same as the model
                        rate
  --kmeans=type         This defines which sitewise values to use: entropy or
                        tiger  --kmeans entropy: use entropies for sitewise
                        values --kmeans tiger: use TIGER rates for sitewise
                        values (only valid for Morphology)
  --rcluster-percent=N  This defines the proportion of possible schemes that
                        the relaxed clustering algorithm will consider before
                        it stops looking. The default is 10%. e.g. --rcluster-
                        percent 10.0
  --rcluster-max=N      This defines the number of possible schemes that the
                        relaxed clustering algorithm will consider before it
                        stops looking. The default is to look at the larger
                        value out of 1000, and 10 times the number of data
                        blocks you have. e.g. --rcluster-max 1000
  --min-subset-size=N   This defines the minimum subset size that the kmeans
                        and rcluster algorithm will accept. Subsets smaller
                        than this  will be merged at with other subsets at the
                        end of the algorithm (for kmeans) or at the start of
                        the algorithm (for rcluster). See manual for details.
                        The default value for kmeans is 100. The default value
                        for rcluster is to ignore this option. e.g. --min-
                        subset-size 100
  --debug-output=REGION,REGION,...
                        (advanced option) Provide a list of debug regions to
                        output extra information about what the program is
                        doing. Possible regions are 'all' or any of {subset,su
                        bset_ops,raxml,parser,model_util,results,entropy,align
                        ment,threadpool,progress,main,config,reporter,kmeans,n
                        eighbour,morph_tige,analysis_m,util,scheme,submodels,d
                        atabase,analysis,phyml,raxml_mode,model_load,phyml_mod
                        e,sklearn}.
  --all-states          In the kmeans and rcluster algorithms, this stipulates
                        that PartitionFinder should not produce subsets that
                        do not have all possible states present. E.g. for DNA
                        sequence data, all subsets in the final scheme must
                        have A, C, T,  and G nucleotides present. This can
                        occasionally be useful for downstream  analyses,
                        particularly concerning amino acid datasets.
  --profile             Output profiling information after running (this will
                        slow everything down!)

Back to Top

Installation

source code from PartitionFinder Github Releases

System

64-bit Linux