AlphaFold-Sapelo2

From Research Computing Center Wiki
Revision as of 11:27, 29 November 2021 by Shtsai (talk | contribs)
Jump to navigation Jump to search


Category

Bioinformatics

Program On

Sapelo2

Version

2.0.0, 2.0.1, 2.1.0, 2.1.1

Author / Distributor

Please see https://github.com/deepmind/alphafold

Description

From https://github.com/deepmind/alphafold: "This package provides an implementation of the inference pipeline of AlphaFold v2.0. This is a completely new model that was entered in CASP14 and published in Nature. "

Running Program

Also refer to Running Jobs on Sapelo2

For more information on Environment Modules on Sapelo2 please see the Lmod page.

  • Version 2.0.0

Installed as a conda environment in /apps/gb/AlphaFold/2.0.0/

To use this version of AlphaFold, please first load the module with

ml AlphaFold/2.0.0_conda

Once you load the module, an environmental variable called EBROOTALPHAFOLD is exported. It stores the AlphaFold installation path on the cluster, i.e., /apps/gb/AlphaFold/2.0.0. The bash script run_alphafold.sh in installed in EBROOTALPHAFOLD/alphafold, and the 2.2TB of database files are in /apps/db/AlphaFold/2.0 (this is the directory that you need to use for the -d option of run_alphafold.sh).

Note: This program does not work on the nodes with K20Xm GPU devices, because the CPUs on those nodes do not support AVX. If you run this program on the gpu_p partition, please request a K40 or a P100 GPU device.


  • Version 2.0.1

Installed with EasyBuild in /apps/eb/AlphaFold/2.0.1-fosscuda-2020b/

To use this version of AlphaFold, please first load the module with

ml AlphaFold/2.0.1-fosscuda-2020b

Once you load the module, an environmental variable called EBROOTALPHAFOLD is exported. It stores the AlphaFold installation path on the cluster, i.e., /apps/eb/AlphaFold/2.0.1-fosscuda-2020b. The python script run_alphafold.py is installed in EBROOTALPHAFOLD/bin and a symbolic link called alphafold points to it and can be used to run the program. The 2.2TB of database files are in /apps/db/AlphaFold/2.0. You can export the environment variable ALPHAFOLD_DATA_DIR to set the location of the database files. For bash, use

export ALPHAFOLD_DATA_DIR=/apps/db/AlphaFold/2.0

Note: This program does not work on the nodes with K20Xm GPU devices, because the CPUs on those nodes do not support AVX. If you run this program on the gpu_p partition, please request a K40 or a P100 GPU device.


  • Version 2.1.0

Installed with EasyBuild in /apps/eb/AlphaFold/2.1.0-fosscuda-2020b/

To use this version of AlphaFold, please first load the module with

ml AlphaFold/2.1.0-fosscuda-2020b

Once you load the module, an environmental variable called EBROOTALPHAFOLD is exported. It stores the AlphaFold installation path on the cluster, i.e., /apps/eb/AlphaFold/2.1.0-fosscuda-2020b. The python script run_alphafold.py is installed in EBROOTALPHAFOLD/bin and a symbolic link called alphafold points to it and can be used to run the program. The 2.2TB of database files are in /apps/db/AlphaFold/2.1. You can export the environment variable ALPHAFOLD_DATA_DIR to set the location of the database files. For bash, use

export ALPHAFOLD_DATA_DIR=/apps/db/AlphaFold/2.1

Note: This program does not work on the nodes with K20Xm GPU devices, because the CPUs on those nodes do not support AVX. If you run this program on the gpu_p partition, please request a K40 or a P100 GPU device.


  • Version 2.1.1

Installed with EasyBuild in /apps/eb/AlphaFold/2.1.1-fosscuda-2020b/

To use this version of AlphaFold, please first load the module with

ml AlphaFold/2.1.1-fosscuda-2020b

Once you load the module, an environmental variable called EBROOTALPHAFOLD is exported. It stores the AlphaFold installation path on the cluster, i.e., /apps/eb/AlphaFold/2.1.1-fosscuda-2020b. The python script run_alphafold.py is installed in EBROOTALPHAFOLD/bin and a symbolic link called alphafold points to it and can be used to run the program. The 2.2TB of database files are in /apps/db/AlphaFold/2.1. You can export the environment variable ALPHAFOLD_DATA_DIR to set the location of the database files. For bash, use

export ALPHAFOLD_DATA_DIR=/apps/db/AlphaFold/2.1

Note: This program does not work on the nodes with K20Xm GPU devices, because the CPUs on those nodes do not support AVX. If you run this program on the gpu_p partition, please request a K40 or a P100 GPU device. This version requires a GPU device.


Sample job submission script (sub.sh) to run AlphaFold 2.0.0 using run_alphafold.sh in a batch job (without GPU):

#!/bin/bash
#SBATCH --job-name=alphafoldjobname       
#SBATCH --partition=batch            
#SBATCH --ntasks=1                  	
#SBATCH --cpus-per-task=4        
#SBATCH --mem=20gb                    
#SBATCH --time=120:00:00           
#SBATCH --output=%x.%j.out     
#SBATCH --error=%x.%j.err          
#SBATCH --mail-user=username@uga.edu  
#SBATCH --mail-type=ALL   

cd $SLURM_SUBMIT_DIR

ml AlphaFold/2.0.0_conda

bash $EBROOTALPHAFOLD/alphafold/run_alphafold.sh -d /apps/db/AlphaFold/2.0 [options]

An example of the required options to use are

bash $EBROOTALPHAFOLD/alphafold/run_alphafold.sh -d /apps/db/AlphaFold/2.0 -o ./test/ -m model_1 -f ./query.fasta -t 2020-05-14


Sample job submission script (sub.sh) to run AlphaFold 2.0.0 using run_alphafold.sh in a batch job (with GPU):

#!/bin/bash
#SBATCH --job-name=alphafoldjobname    
#SBATCH --partition=gpu_p         
#SBATCH --ntasks=1                  	
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:K40:1
#SBATCH --mem=40gb                    
#SBATCH --time=120:00:00           
#SBATCH --output=%x.%j.out     
#SBATCH --error=%x.%j.err          
#SBATCH --mail-user=username@uga.edu  
#SBATCH --mail-type=ALL   

cd $SLURM_SUBMIT_DIR

ml AlphaFold/2.0.0_conda

bash $EBROOTALPHAFOLD/alphafold/run_alphafold.sh -d /apps/db/AlphaFold/2.0 [options]

where $EBROOTALPHAFOLD is the environmental variable that stores the AlphaFold installation path on the cluster; [options] need to be replaced by the options (command and arguments) you want to use. Other parameters of the job, such as the maximum wall clock time, maximum memory, the number of cores per node, and the job name need to be modified appropriately as well. You can also request a P100 device, using #SBATCH --gres=gpu:P100:1 if you submit the job to the gpu_p partition.

Sample job submission script (sub.sh) to run AlphaFold 2.0.1 in a batch job (with GPU):

#!/bin/bash
#SBATCH --job-name=alphafoldjobname    
#SBATCH --partition=gpu_p         
#SBATCH --ntasks=1                  	
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:P100:1
#SBATCH --mem=40gb                    
#SBATCH --time=120:00:00           
#SBATCH --output=%x.%j.out     
#SBATCH --error=%x.%j.err          
#SBATCH --mail-user=username@uga.edu  
#SBATCH --mail-type=ALL   

cd $SLURM_SUBMIT_DIR

ml AlphaFold/2.0.1-fosscuda-2020b

export ALPHAFOLD_DATA_DIR=/apps/db/AlphaFold

alphafold [options]

where [options] need to be replaced by the options (command and arguments) you want to use. Other parameters of the job, such as the maximum wall clock time, maximum memory, the number of cores per node, and the job name need to be modified appropriately as well.

An example of the options to use for the alphafold script:

alphafold --data_dir /apps/db/AlphaFold/2.0 --output_dir ./output --model_names model_1 --fasta_paths ./query.fasta --max_template_date 2021-11-17

Example of job submission

sbatch sub.sh 

Documentation

Details and references are at https://github.com/deepmind/alphafold.

Version 2.0.0:

ml AlphaFold/2.0.0_conda

bash $EBROOTALPHAFOLD/alphafold/run_alphafold.sh -h

Usage: /apps/gb/AlphaFold/2.0.0_conda/alphafold/run_alphafold.sh <OPTIONS>
Required Parameters:
-d <data_dir>     Path to directory of supporting data
-o <output_dir>   Path to a directory that will store the results.
-m <model_names>  Names of models to use (a comma separated list)
-f <fasta_path>   Path to a FASTA file containing one sequence
-t <max_template_date> Maximum template release date to consider (ISO-8601 format - i.e. YYYY-MM-DD). Important if folding historical test sets
Optional Parameters:
-b <benchmark>    Run multiple JAX model evaluations to obtain a timing that excludes the compilation time, which should be more indicative of the time required for inferencing many
    proteins (default: 'False')
-g <use_gpu>      Enable NVIDIA runtime to run with GPUs (default: 'True')
-a <gpu_devices>  Comma separated list of devices to pass to 'CUDA_VISIBLE_DEVICES' (default: 'all')
-p <preset>       Choose preset model configuration - no ensembling (full_dbs) or 8 model ensemblings (casp14) (default: 'full_dbs')


Version 2.0.1: Short help options

ml AlphaFold/2.0.1-fosscuda-2020b

export ALPHAFOLD_DATA_DIR=/apps/db/AlphaFold/2.0

alphafold --helpshort
/apps/eb/jax/0.2.19-fosscuda-2020b/lib/python3.8/site-packages/absl/flags/_validators.py:203: UserWarning: Flag --preset has a non-None default value; therefore, mark_flag_as_required will pass even if flag is not specified in the command line!
  warnings.warn(
Full AlphaFold protein structure prediction script.
flags:

/apps/eb/AlphaFold/2.0.1-fosscuda-2020b/bin/alphafold:
  --[no]benchmark: Run multiple JAX model evaluations to obtain a timing that
    excludes the compilation time, which should be more indicative of the time
    required for inferencing many proteins.
    (default: 'false')
  --bfd_database_path: Path to the BFD database for use by HHblits.
    (default: '/apps/db/AlphaFold/bfd/bfd_metaclust_clu_complete_id30_c90_final_
    seq.sorted_opt')
  --data_dir: Path to directory of supporting data.
    (default: '/apps/db/AlphaFold/2.0')
  --fasta_paths: Paths to FASTA files, each containing one sequence. Paths
    should be separated by commas. All FASTA paths must have a unique basename
    as the basename is used to name the output directories for each prediction.
    (a comma separated list)
  --hhblits_binary_path: Path to the HHblits executable.
    (default: '/apps/eb/HH-suite/3.3.0-gompic-2020b/bin/hhblits')
  --hhsearch_binary_path: Path to the HHsearch executable.
    (default: '/apps/eb/HH-suite/3.3.0-gompic-2020b/bin/hhsearch')
  --jackhmmer_binary_path: Path to the JackHMMER executable.
    (default: '/apps/eb/HMMER/3.3.2-gompic-2020b/bin/jackhmmer')
  --kalign_binary_path: Path to the Kalign executable.
    (default: '/apps/eb/Kalign/3.3.1-GCCcore-10.2.0/bin/kalign')
  --max_template_date: Maximum template release date to consider. Important if
    folding historical test sets.
  --mgnify_database_path: Path to the MGnify database for use by JackHMMER.
    (default: '/apps/db/AlphaFold/mgnify/mgy_clusters.fa')
  --model_names: Names of models to use.
    (a comma separated list)
  --obsolete_pdbs_path: Path to file containing a mapping from obsolete PDB IDs
    to the PDB IDs of their replacements.
    (default: '/apps/db/AlphaFold/pdb_mmcif/obsolete.dat')
  --output_dir: Path to a directory that will store the results.
  --pdb70_database_path: Path to the PDB70 database for use by HHsearch.
    (default: '/apps/db/AlphaFold/pdb70/pdb70')
  --preset: <reduced_dbs|full_dbs|casp14>: Choose preset model configuration -
    no ensembling and smaller genetic database config (reduced_dbs), no
    ensembling and full genetic database config  (full_dbs) or full genetic
    database config and 8 model ensemblings (casp14).
    (default: 'full_dbs')
  --random_seed: The random seed for the data pipeline. By default, this is
    randomly generated. Note that even if this is set, Alphafold may still not
    be deterministic, because processes like GPU inference are nondeterministic.
    (an integer)
  --small_bfd_database_path: Path to the small version of BFD used with the
    "reduced_dbs" preset.
  --template_mmcif_dir: Path to a directory with template mmCIF structures, each
    named <pdb_id>.cif
    (default: '/apps/db/AlphaFold/pdb_mmcif/mmcif_files')
  --uniclust30_database_path: Path to the Uniclust30 database for use by
    HHblits.
    (default:
    '/apps/db/AlphaFold/uniclust30/uniclust30_2018_08/uniclust30_2018_08')
  --uniref90_database_path: Path to the Uniref90 database for use by JackHMMER.
    (default: '/apps/db/AlphaFold/uniref90/uniref90.fasta')

Try --helpfull to get a list of all flags.

Version 2.0.1: Full help options

ml AlphaFold/2.0.1-fosscuda-2020b

export ALPHAFOLD_DATA_DIR=/apps/db/AlphaFold/2.0

alphafold --helpfull
/apps/eb/jax/0.2.19-fosscuda-2020b/lib/python3.8/site-packages/absl/flags/_validators.py:203: UserWarning: Flag --preset has a non-None default value; therefore, mark_flag_as_required will pass even if flag is not specified in the command line!
  warnings.warn(
Full AlphaFold protein structure prediction script.
flags:

/apps/eb/AlphaFold/2.0.1-fosscuda-2020b/bin/alphafold:
  --[no]benchmark: Run multiple JAX model evaluations to obtain a timing that
    excludes the compilation time, which should be more indicative of the time
    required for inferencing many proteins.
    (default: 'false')
  --bfd_database_path: Path to the BFD database for use by HHblits.
    (default: '/apps/db/AlphaFold/bfd/bfd_metaclust_clu_complete_id30_c90_final_
    seq.sorted_opt')
  --data_dir: Path to directory of supporting data.
    (default: '/apps/db/AlphaFold/2.0')
  --fasta_paths: Paths to FASTA files, each containing one sequence. Paths
    should be separated by commas. All FASTA paths must have a unique basename
    as the basename is used to name the output directories for each prediction.
    (a comma separated list)
  --hhblits_binary_path: Path to the HHblits executable.
    (default: '/apps/eb/HH-suite/3.3.0-gompic-2020b/bin/hhblits')
  --hhsearch_binary_path: Path to the HHsearch executable.
    (default: '/apps/eb/HH-suite/3.3.0-gompic-2020b/bin/hhsearch')
  --jackhmmer_binary_path: Path to the JackHMMER executable.
    (default: '/apps/eb/HMMER/3.3.2-gompic-2020b/bin/jackhmmer')
  --kalign_binary_path: Path to the Kalign executable.
    (default: '/apps/eb/Kalign/3.3.1-GCCcore-10.2.0/bin/kalign')
  --max_template_date: Maximum template release date to consider. Important if
    folding historical test sets.
  --mgnify_database_path: Path to the MGnify database for use by JackHMMER.
    (default: '/apps/db/AlphaFold/mgnify/mgy_clusters.fa')
  --model_names: Names of models to use.
    (a comma separated list)
  --obsolete_pdbs_path: Path to file containing a mapping from obsolete PDB IDs
    to the PDB IDs of their replacements.
    (default: '/apps/db/AlphaFold/pdb_mmcif/obsolete.dat')
  --output_dir: Path to a directory that will store the results.
  --pdb70_database_path: Path to the PDB70 database for use by HHsearch.
    (default: '/apps/db/AlphaFold/pdb70/pdb70')
  --preset: <reduced_dbs|full_dbs|casp14>: Choose preset model configuration -
    no ensembling and smaller genetic database config (reduced_dbs), no
    ensembling and full genetic database config  (full_dbs) or full genetic
    database config and 8 model ensemblings (casp14).
    (default: 'full_dbs')
  --random_seed: The random seed for the data pipeline. By default, this is
    randomly generated. Note that even if this is set, Alphafold may still not
    be deterministic, because processes like GPU inference are nondeterministic.
    (an integer)
  --small_bfd_database_path: Path to the small version of BFD used with the
    "reduced_dbs" preset.
  --template_mmcif_dir: Path to a directory with template mmCIF structures, each
    named <pdb_id>.cif
    (default: '/apps/db/AlphaFold/pdb_mmcif/mmcif_files')
  --uniclust30_database_path: Path to the Uniclust30 database for use by
    HHblits.
    (default:
    '/apps/db/AlphaFold/uniclust30/uniclust30_2018_08/uniclust30_2018_08')
  --uniref90_database_path: Path to the Uniref90 database for use by JackHMMER.
    (default: '/apps/db/AlphaFold/uniref90/uniref90.fasta')

absl.app:
  -?,--[no]help: show this help
    (default: 'false')
  --[no]helpfull: show full help
    (default: 'false')
  --[no]helpshort: show this help
    (default: 'false')
  --[no]helpxml: like --helpfull, but generates XML output
    (default: 'false')
  --[no]only_check_args: Set to true to validate args and exit.
    (default: 'false')
  --[no]pdb: Alias for --pdb_post_mortem.
    (default: 'false')
  --[no]pdb_post_mortem: Set to true to handle uncaught exceptions with PDB post
    mortem.
    (default: 'false')
  --profile_file: Dump profile information to a file (for python -m pstats).
    Implies --run_with_profiling.
  --[no]run_with_pdb: Set to true for PDB debug mode
    (default: 'false')
  --[no]run_with_profiling: Set to true for profiling the script. Execution will
    be slower, and the output format might change over time.
    (default: 'false')
  --[no]use_cprofile_for_profiling: Use cProfile instead of the profile module
    for profiling. This has no effect unless --run_with_profiling is set.
    (default: 'true')

absl.logging:
  --[no]alsologtostderr: also log to stderr?
    (default: 'false')
  --log_dir: directory to write logfiles into
    (default: '')
  --logger_levels: Specify log level of loggers. The format is a CSV list of
    `name:level`. Where `name` is the logger name used with
    `logging.getLogger()`, and `level` is a level name  (INFO, DEBUG, etc). e.g.
    `myapp.foo:INFO,other.logger:DEBUG`
    (default: '')
  --[no]logtostderr: Should only log to stderr?
    (default: 'false')
  --[no]showprefixforinfo: If False, do not prepend prefix to info messages when
    it's logged to stderr, --verbosity is set to INFO level, and python logging
    is used.
    (default: 'true')
  --stderrthreshold: log messages at this level, or more severe, to stderr in
    addition to the logfile.  Possible values are 'debug', 'info', 'warning',
    'error', and 'fatal'.  Obsoletes --alsologtostderr. Using --alsologtostderr
    cancels the effect of this flag. Please also note that this flag is subject
    to --verbosity and requires logfile not be stderr.
    (default: 'fatal')
  -v,--verbosity: Logging verbosity level. Messages logged at this level or
    lower will be included. Set to 1 for debug logging. If the flag was not set
    or supplied, the value will be changed from the default of -1 (warning) to 0
    (info) after flags are parsed.
    (default: '-1')
    (an integer)

absl.testing.absltest:
  --test_random_seed: Random seed for testing. Some test frameworks may change
    the default value of this flag between runs, so it is not appropriate for
    seeding probabilistic tests.
    (default: '301')
    (an integer)
  --test_randomize_ordering_seed: If positive, use this as a seed to randomize
    the execution order for test cases. If "random", pick a random seed to use.
    If 0 or not set, do not randomize test case execution order. This flag also
    overrides the TEST_RANDOMIZE_ORDERING_SEED environment variable.
    (default: '')
  --test_srcdir: Root of directory tree where source files live
    (default: '')
  --test_tmpdir: Directory for temporary testing files
    (default: '/tmp/absl_testing')
  --xml_output_file: File to store XML test results
    (default: '')

tensorflow.python.ops.parallel_for.pfor:
  --[no]op_conversion_fallback_to_while_loop: DEPRECATED: Flag is ignored.
    (default: 'true')

tensorflow.python.tpu.client.client:
  --[no]hbm_oom_exit: Exit the script when the TPU HBM is OOM.
    (default: 'true')
  --[no]runtime_oom_exit: Exit the script when the TPU runtime is OOM.
    (default: 'true')

absl.flags:
  --flagfile: Insert flag definitions from the given file into the command line.
    (default: '')
  --undefok: comma-separated list of flag names that it is okay to specify on
    the command line even if the program does not define a flag with that name.
    IMPORTANT: flags in this list that have arguments MUST use the --flag=value
    format.
    (default: '')

Back to Top

Installation

  • Version 2.0.1: Installed using EasyBuild.
  • Version 2.1.0: Installed using EasyBuild.
  • Version 2.1.1: Installed using EasyBuild.
  • The database files are installed in /apps/db/AlphaFold/

System

64-bit Linux