AlphaFold3-Sapelo2

From Research Computing Center Wiki
Jump to navigation Jump to search


Category

Bioinformatics

Program On

Sapelo2

Version

3.0.0

Author / Distributor

Please see https://github.com/google-deepmind/alphafold3

Description

From https://github.com/google-deepmind/alphafold3: "This package provides an implementation of the inference pipeline of AlphaFold 3. "

Running Program

Also refer to Running Jobs on Sapelo2

  • Version 3.0.0

This version is installed as a singularity container:

/apps/singularity-images/alphafold-3.0.0.sif

You can view the documentation for this version of AlphaFold with the following command, on an interactive (i.e., NOT a login/submit) node:

singularity exec /apps/singularity-images/alphafold-3.0.0.sif python /app/alphafold/run_alphafold.py --helpfull

Note that AlphaFold 3 is GPU dependent while also being CPU and memory intensive, meaning that it will only run on the A100 and H100 nodes and needs numerous CPUs to run in a timely/efficient manner (see the example job submission script below). AlphaFold 3 also depends on a set of database files, the current versions of which can be found at /db/AlphaFold3/20241114, and a file containing the model parameters; currently this is only available directly from Google, and can be obtained by completing the form found here: https://docs.google.com/forms/d/e/1FAIpQLSfWZAgo1aYk0O4MuAXZj8xRQ8DafeFJnldNOnh_13qAx2ceZw/viewform. Once these are obtained and placed somewhere in your storage space on Sapelo2 (/home or /work is recommended), they must be included with the '-B' or "--bind" option at runtime, as must the location of the input file and output directory (see example below). Also note that although AlphaFold 3 was created for CUDA v12.6 and we currently only have CUDA v12.2, AlphaFold 3 will still work (it prints a message noting the fact that it's running in an old version of CUDA, but then proceeds to run anyway).


Sample job submission script to run the singularity container for v3.0.0 on a GPU:

#!/bin/bash
#SBATCH --job-name=alphafold3			#Name your job something original
#SBATCH --partition=gpu_p			#Use the GPU partition
#SBATCH --ntasks=1		
#SBATCH --cpus-per-task=32			#If you use the default options, AlphaFold3 will run four simutaneous Jackhmmer processes with 8 CPUs each
#SBATCH --gres=gpu:1				#If you don’t care whether your job uses an A100 node or an H100 node (and there isn’t much difference in run time)…
#SBATCH --constraint=Milan|SapphireRapids	#…this is the easiest way to specify either one without accidentally using a P100 or L4, which lack sufficient device memory
#SBATCH --mem=60gb
#SBATCH --time=120:00:00
#SBATCH --output=%x.%j.out     
#SBATCH --error=%x.%j.err          

cd $SLURM_SUBMIT_DIR

singularity exec \
     --nv \
     --bind /Path/to/input/file:/root/af_input \
     --bind /Path/to/output_directory:/root/af_output \
     --bind /Path/to/model_parameters:/root/models \
     --bind /db/AlphaFold3/20241114:/root/public_databases \
     /apps/singularity-images/alphafold-3.0.0.sif \
     python /app/alphafold/run_alphafold.py \
     --json_path=/root/af_input/NAME_OF_INPUT_FILE.json \
     --model_dir=/root/models \
     --db_dir=/root/public_databases \
     --output_dir=/root/af_output

Example input JSON file:

{
  "name": "2PV7",
  "sequences": [
    {
      "protein": {
        "id": ["A", "B"],
        "sequence": "GMRESYANENQFGFKTINSDIHKIVIVGGYGKLGGLFARYLRASGYPISILDREDWAVAESILANADVVIVSVPINLTLETIERLKPYLTENMLLADLTSVKREPLAKMLEVHTGAVLGLHPMFGADIASMAKQVVVRCDGRFPERYEWLLEQIQIWGAKIYQTNATEHDHNMTYIQALRHFSTFANGLHLSKQPINLANLLALSSPIYRLELAMIGRLFAQDAELYADIIMDKSENLAVIETLKQTYDEKQTYDEALTFFENNDRQGFIDAFHKVRDWFGDYSEQFLKESRQLLQQANDLKQG"
      }
    }
  ],
  "modelSeeds": [1],
  "dialect": "alphafold3",
  "version": 1
}

Note: the above homodimer protein runs on both the H100 and A100 nodes in under an hour with 32 CPUs, so if you find your job is sitting in the queue for long periods of time, you could consider requesting significantly less than 120 hours (especially if your protein is either small or is otherwise well characterized and thus likely to align to the databases relatively quickly).


Documentation

Details and references are at: https://github.com/google-deepmind/alphafold3

Version 3.0.0: Full help options

$ singularity exec /apps/singularity-images/alphafold-3.0.0.sif python /app/alphafold/run_alphafold.py --helpfull
AlphaFold 3 structure prediction script.

AlphaFold 3 source code is licensed under CC BY-NC-SA 4.0. To view a copy of
this license, visit https://creativecommons.org/licenses/by-nc-sa/4.0/

To request access to the AlphaFold 3 model parameters, follow the process set
out at https://github.com/google-deepmind/alphafold3. You may only use these
if received directly from Google. Use is subject to terms of use available at
https://github.com/google-deepmind/alphafold3/blob/main/WEIGHTS_TERMS_OF_USE.md

flags:

/app/alphafold/run_alphafold.py:
  --buckets: Strictly increasing order of token sizes for which to cache compilations. For any input with more tokens than the largest bucket size, a new bucket is created for exactly that number of tokens.
    (default: '256,512,768,1024,1280,1536,2048,2560,3072,3584,4096,4608,5120')
    (a comma separated list)
  --db_dir: Path to the directory containing the databases.
    (default: '/home/MyID/public_databases')
  --flash_attention_implementation: <triton|cudnn|xla>: Flash attention implementation to use. 'triton' and 'cudnn' uses a Triton and cuDNN flash attention implementation, respectively. The Triton kernel is fastest and has been tested more thoroughly. The Triton and cuDNN
    kernels require Ampere GPUs or later. 'xla' uses an XLA attention implementation (no flash attention) and is portable across GPU devices.
    (default: 'triton')
  --hmmalign_binary_path: Path to the Hmmalign binary.
    (default: '/hmmer/bin/hmmalign')
  --hmmbuild_binary_path: Path to the Hmmbuild binary.
    (default: '/hmmer/bin/hmmbuild')
  --hmmsearch_binary_path: Path to the Hmmsearch binary.
    (default: '/hmmer/bin/hmmsearch')
  --input_dir: Path to the directory containing input JSON files.
  --jackhmmer_binary_path: Path to the Jackhmmer binary.
    (default: '/hmmer/bin/jackhmmer')
  --jackhmmer_n_cpu: Number of CPUs to use for Jackhmmer. Default to min(cpu_count, 8). Going beyond 8 CPUs provides very little additional speedup.
    (default: '8')
    (an integer)
  --jax_compilation_cache_dir: Path to a directory for the JAX compilation cache.
  --json_path: Path to the input JSON file.
  --mgnify_database_path: Mgnify database path, used for protein MSA search.
    (default: '${DB_DIR}/mgy_clusters_2022_05.fa')
  --model_dir: Path to the model to use for inference.
    (default: '/home/MyID/models')
  --nhmmer_binary_path: Path to the Nhmmer binary.
    (default: '/hmmer/bin/nhmmer')
  --nhmmer_n_cpu: Number of CPUs to use for Nhmmer. Default to min(cpu_count, 8). Going beyond 8 CPUs provides very little additional speedup.
    (default: '8')
    (an integer)
  --ntrna_database_path: NT-RNA database path, used for RNA MSA search.
    (default: '${DB_DIR}/nt_rna_2023_02_23_clust_seq_id_90_cov_80_rep_seq.fasta')
  --output_dir: Path to a directory where the results will be saved.
  --pdb_database_path: PDB database directory with mmCIF files path, used for template search.
    (default: '${DB_DIR}/pdb_2022_09_28_mmcif_files.tar')
  --rfam_database_path: Rfam database path, used for RNA MSA search.
    (default: '${DB_DIR}/rfam_14_9_clust_seq_id_90_cov_80_rep_seq.fasta')
  --rna_central_database_path: RNAcentral database path, used for RNA MSA search.
    (default: '${DB_DIR}/rnacentral_active_seq_id_90_cov_80_linclust.fasta')
  --[no]run_data_pipeline: Whether to run the data pipeline on the fold inputs.
    (default: 'true')
  --[no]run_inference: Whether to run inference on the fold inputs.
    (default: 'true')
  --seqres_database_path: PDB sequence database path, used for template search.
    (default: '${DB_DIR}/pdb_seqres_2022_09_28.fasta')
  --small_bfd_database_path: Small BFD database path, used for protein MSA search.
    (default: '${DB_DIR}/bfd-first_non_consensus_sequences.fasta')
  --uniprot_cluster_annot_database_path: UniProt database path, used for protein paired MSA search.
    (default: '${DB_DIR}/uniprot_all_2021_04.fa')
  --uniref90_database_path: UniRef90 database path, used for MSA search. The MSA obtained by searching it is used to construct the profile for template search.
    (default: '${DB_DIR}/uniref90_2022_05.fa')

absl.app:
  -?,--[no]help: show this help
    (default: 'false')
  --[no]helpfull: show full help
    (default: 'false')
  --[no]helpshort: show this help
    (default: 'false')
  --[no]helpxml: like --helpfull, but generates XML output
    (default: 'false')
  --[no]only_check_args: Set to true to validate args and exit.
    (default: 'false')
  --[no]pdb: Alias for --pdb_post_mortem.
    (default: 'false')
  --[no]pdb_post_mortem: Set to true to handle uncaught exceptions with PDB post mortem.
    (default: 'false')
  --profile_file: Dump profile information to a file (for python -m pstats). Implies --run_with_profiling.
  --[no]run_with_pdb: Set to true for PDB debug mode
    (default: 'false')
  --[no]run_with_profiling: Set to true for profiling the script. Execution will be slower, and the output format might change over time.
    (default: 'false')
  --[no]use_cprofile_for_profiling: Use cProfile instead of the profile module for profiling. This has no effect unless --run_with_profiling is set.
    (default: 'true')

absl.logging:
  --[no]alsologtostderr: also log to stderr?
    (default: 'false')
  --log_dir: directory to write logfiles into
    (default: '')
  --logger_levels: Specify log level of loggers. The format is a CSV list of `name:level`. Where `name` is the logger name used with `logging.getLogger()`, and `level` is a level name  (INFO, DEBUG, etc). e.g. `myapp.foo:INFO,other.logger:DEBUG`
    (default: '')
  --[no]logtostderr: Should only log to stderr?
    (default: 'false')
  --[no]showprefixforinfo: If False, do not prepend prefix to info messages when it's logged to stderr, --verbosity is set to INFO level, and python logging is used.
    (default: 'true')
  --stderrthreshold: log messages at this level, or more severe, to stderr in addition to the logfile.  Possible values are 'debug', 'info', 'warning', 'error', and 'fatal'.  Obsoletes --alsologtostderr. Using --alsologtostderr cancels the effect of this flag. Please also
    note that this flag is subject to --verbosity and requires logfile not be stderr.
    (default: 'fatal')
  -v,--verbosity: Logging verbosity level. Messages logged at this level or lower will be included. Set to 1 for debug logging. If the flag was not set or supplied, the value will be changed from the default of -1 (warning) to 0 (info) after flags are parsed.
    (default: '-1')
    (an integer)

absl.testing.absltest:
  --test_random_seed: Random seed for testing. Some test frameworks may change the default value of this flag between runs, so it is not appropriate for seeding probabilistic tests.
    (default: '301')
    (an integer)
  --test_randomize_ordering_seed: If positive, use this as a seed to randomize the execution order for test cases. If "random", pick a random seed to use. If 0 or not set, do not randomize test case execution order. This flag also overrides the
    TEST_RANDOMIZE_ORDERING_SEED environment variable.
    (default: '')
  --test_srcdir: Root of directory tree where source files live
    (default: '')
  --test_tmpdir: Directory for temporary testing files
    (default: '/tmp/absl_testing')
  --xml_output_file: File to store XML test results
    (default: '')

chex._src.fake:
  --[no]chex_assert_multiple_cpu_devices: Whether to fail if a number of CPU devices is less than 2.
    (default: 'false')
  --chex_n_cpu_devices: Number of CPU threads to use as devices in tests.
    (default: '1')
    (an integer)

chex._src.variants:
  --[no]chex_skip_pmap_variant_if_single_device: Whether to skip pmap variant if only one device is available.
    (default: 'true')

absl.flags:
  --flagfile: Insert flag definitions from the given file into the command line.
    (default: '')
  --undefok: comma-separated list of flag names that it is okay to specify on the command line even if the program does not define a flag with that name.  IMPORTANT: flags in this list that have arguments MUST use the --flag=value format.
    (default: '')

Back to Top