AlphaFold-Sapelo2: Difference between revisions

From Research Computing Center Wiki
Jump to navigation Jump to search
No edit summary
 
(39 intermediate revisions by 2 users not shown)
Line 8: Line 8:


=== Version ===
=== Version ===
2.0.0, 2.0.1, 2.1.0, 2.1.1, 2.2.0, 2.2.4, 2.3.1
2.2.4, 2.3.1, 2.3.4


=== Author / Distributor ===
=== Author / Distributor ===
Line 21: Line 21:
For more information on Environment Modules on Sapelo2 please see the [[Lmod]] page.
For more information on Environment Modules on Sapelo2 please see the [[Lmod]] page.


*'''Version 2.0.0'''


Installed as a conda environment in /apps/gb/AlphaFold/2.0.0/
*'''Version 2.2.4'''


To use this version of AlphaFold, please first load the module with
This version is installed as a singularity container pulled from https://hub.docker.com/r/catgumag/alphafold:
<pre class="gscript">
<pre class="gscript">
ml AlphaFold/2.0.0_conda
/apps/singularity-images/alphafold_2.2.4.sif
</pre>
</pre>
You can view the documentation for this version of AlphaFold with the following command, on an interactive node:
<pre class="gcommand">
singularity exec /apps/singularity-images/alphafold_2.2.4.sif python /app/alphafold/run_alphafold.py --helpfull
</pre>
This version works on nodes where the CPU processor is Intel, such as the P100 GPU nodes (note that this container does not work on the A100 node, which has an AMD processor).


Once you load the module, an environmental variable called EBROOTALPHAFOLD is exported. It stores the AlphaFold installation path on the cluster, i.e., /apps/gb/AlphaFold/2.0.0. The bash script run_alphafold.sh in installed in EBROOTALPHAFOLD/alphafold, and the 2.2TB of database files are in /apps/db/AlphaFold/2.0 (this is the directory that you need to use for the -d option of run_alphafold.sh).  
The database files are installed in /apps/db/AlphaFold/2.2.4. To use these in the singularity container, please add the option <code>-B /apps/db/AlphaFold </code> to the singularity exec command, as shown in the sample job submission scripts below. The <code> --nv </code> option needs to be added to enable AlphaFold to run on the GPU. This singularity container also requires the option <code>--use_gpu_relax</code> to be added.


'''Note:''' This program does not work on the nodes with K20Xm GPU devices, because the CPUs on those nodes do not support AVX. If you run this program on the gpu_p partition, please request a K40 or a P100 GPU device.


*'''Version 2.3.1 (on Intel nodes only)'''


*'''Version 2.0.1'''
This version is installed as a singularity container pulled from https://hub.docker.com/r/catgumag/alphafold:
 
Installed with EasyBuild in /apps/eb/AlphaFold/2.0.1-fosscuda-2020b/
 
To use this version of AlphaFold, please first load the module with
<pre class="gscript">
<pre class="gscript">
ml AlphaFold/2.0.1-fosscuda-2020b
/apps/singularity-images/alphafold_2.3.1_cuda112.sif
</pre>
</pre>
 
You can view the documentation for this version of AlphaFold with the following command, on an interactive node:
Once you load the module, an environmental variable called EBROOTALPHAFOLD is exported. It stores the AlphaFold installation path on the cluster, i.e., /apps/eb/AlphaFold/2.0.1-fosscuda-2020b. The python script run_alphafold.py is installed in EBROOTALPHAFOLD/bin and a symbolic link called alphafold points to it and can be used to run the program. The 2.2TB of database files are in /apps/db/AlphaFold/2.0. You can export the environment variable ALPHAFOLD_DATA_DIR to set the location of the database files. For bash, use
<pre class="gcommand">
<pre class="gscript">
singularity exec /apps/singularity-images/alphafold_2.3.1_cuda112.sif python /app/alphafold/run_alphafold.py --helpfull
export ALPHAFOLD_DATA_DIR=/apps/db/AlphaFold/2.0
</pre>
</pre>
When you load the <code>AlphaFold/2.0.1-fosscuda-2020b</code> module, this environment variable will be automatically set.
This version works on nodes where the CPU processor is Intel, such as the P100 GPU nodes (note that this container does not work on the A100 node, which has an AMD processor).  


'''Note:''' This program does not work on the nodes with K20Xm GPU devices, because the CPUs on those nodes do not support AVX. If you run this program on the gpu_p partition, please request a K40 or a P100 GPU device.
The database files are installed in /apps/db/AlphaFold/2.3.1. To use these in the singularity container, please add the option <code>-B /apps/db/AlphaFold </code> to the singularity exec command, as shown in the sample job submission scripts below. The <code> --nv </code> option needs to be added to enable AlphaFold to run on the GPU. Please also add the option <code>-B /apps/eb/CUDAcore/11.2.1 </code>, so the singularity container can link to the CUDA libraries. This singularity container also requires the option <code>--use_gpu_relax</code> to be added.




*'''Version 2.1.0'''
*'''Version 2.3.1'''


Installed with EasyBuild in /apps/eb/AlphaFold/2.1.0-fosscuda-2020b/
Installed with EasyBuild in /apps/eb/AlphaFold/2.3.1-foss-2022a-CUDA-11.7.0/


To use this version of AlphaFold, please first load the module with
To use this version of AlphaFold, please first load the module with
<pre class="gscript">
<pre class="gscript">
ml AlphaFold/2.1.0-fosscuda-2020b
ml AlphaFold/2.3.1-foss-2022a-CUDA-11.7.0
</pre>
</pre>


Once you load the module, an environmental variable called EBROOTALPHAFOLD is exported. It stores the AlphaFold installation path on the cluster, i.e., /apps/eb/AlphaFold/2.1.0-fosscuda-2020b. The python script run_alphafold.py is installed in EBROOTALPHAFOLD/bin and a symbolic link called alphafold points to it and can be used to run the program. The 2.2TB of database files are in /apps/db/AlphaFold/2.1. You can export the environment variable ALPHAFOLD_DATA_DIR to set the location of the database files. For bash, use
Once you load the module, an environmental variable called EBROOTALPHAFOLD is exported. It stores the AlphaFold installation path on the cluster, i.e., /apps/eb/AlphaFold/2.3.1-foss-2022a-CUDA-11.7.0. The python script run_alphafold.py is installed in EBROOTALPHAFOLD/bin and a symbolic link called alphafold points to it and can be used to run the program. The 2.2TB of database files are in /apps/db/AlphaFold/2.3.1. You can export the environment variable ALPHAFOLD_DATA_DIR to set the location of the database files. For bash, use
<pre class="gscript">
<pre class="gscript">
export ALPHAFOLD_DATA_DIR=/apps/db/AlphaFold/2.1
export ALPHAFOLD_DATA_DIR=/apps/db/AlphaFold/2.3.1
</pre>
</pre>
When you load the <code>AlphaFold/2.1.0-fosscuda-2020b</code> module, this environment variable will be automatically set.
When you load the <code>AlphaFold/2.3.1-foss-2022a-CUDA-11.7.0</code> module, this environment variable will be automatically set.


'''Note:''' This program does not work on the nodes with K20Xm GPU devices, because the CPUs on those nodes do not support AVX. If you run this program on the gpu_p partition, please request a K40 or a P100 GPU device.
'''Note:''' If you run this program on the gpu_p partition, please request a P100 or an A100 GPU device. This version requires a GPU device and it should work on the P100, V100, and A100 devices.  




*'''Version 2.1.1'''


Installed with EasyBuild in /apps/eb/AlphaFold/2.1.1-fosscuda-2020b/
*'''Version 2.3.4'''


To use this version of AlphaFold, please first load the module with
Installed with EasyBuild in /apps/eb/AlphaFold/2.3.4-foss-2022a-CUDA-11.7.0-ColabFold/
<pre class="gscript">
ml AlphaFold/2.1.1-fosscuda-2020b
</pre>
 
Once you load the module, an environmental variable called EBROOTALPHAFOLD is exported. It stores the AlphaFold installation path on the cluster, i.e., /apps/eb/AlphaFold/2.1.1-fosscuda-2020b. The python script run_alphafold.py is installed in EBROOTALPHAFOLD/bin and a symbolic link called alphafold points to it and can be used to run the program. The 2.2TB of database files are in /apps/db/AlphaFold/2.1. You can export the environment variable ALPHAFOLD_DATA_DIR to set the location of the database files. For bash, use
<pre class="gscript">
export ALPHAFOLD_DATA_DIR=/apps/db/AlphaFold/2.1
</pre>
When you load the <code>AlphaFold/2.1.1-fosscuda-2020b</code> module, this environment variable will be automatically set.
 
'''Note:''' This program does not work on the nodes with K20Xm GPU devices, because the CPUs on those nodes do not support AVX. If you run this program on the gpu_p partition, please request a K40 or a P100 GPU device. This version requires a GPU device.
 
 
*'''Version 2.2.0'''
 
Installed with EasyBuild in /apps/eb/AlphaFold/2.2.0-fosscuda-2020b/


To use this version of AlphaFold, please first load the module with
To use this version of AlphaFold, please first load the module with
<pre class="gscript">
<pre class="gscript">
ml AlphaFold/2.2.0-fosscuda-2020b
ml AlphaFold/2.3.4-foss-2022a-CUDA-11.7.0-ColabFold
</pre>
</pre>


Once you load the module, an environmental variable called EBROOTALPHAFOLD is exported. It stores the AlphaFold installation path on the cluster, i.e., /apps/eb/AlphaFold/2.2.0-fosscuda-2020b. The python script run_alphafold.py is installed in EBROOTALPHAFOLD/bin and a symbolic link called alphafold points to it and can be used to run the program. The 2.2TB of database files are in /apps/db/AlphaFold/2.2. You can export the environment variable ALPHAFOLD_DATA_DIR to set the location of the database files. For bash, use
Once you load the module, an environmental variable called EBROOTALPHAFOLD is exported. It stores the AlphaFold installation path on the cluster, i.e., /apps/eb/AlphaFold/2.3.4-foss-2022a-CUDA-11.7.0-ColabFold. The python script run_alphafold.py is installed in EBROOTALPHAFOLD/bin and a symbolic link called alphafold points to it and can be used to run the program. The 2.2TB of database files are in /apps/db/AlphaFold/2.3.4. You can export the environment variable ALPHAFOLD_DATA_DIR to set the location of the database files. For bash, use
<pre class="gscript">
<pre class="gscript">
export ALPHAFOLD_DATA_DIR=/apps/db/AlphaFold/2.2
export ALPHAFOLD_DATA_DIR=/apps/db/AlphaFold/2.3.4
</pre>
</pre>
When you load the <code>AlphaFold/2.2.0-fosscuda-2020b</code> module, this environment variable will be automatically set.
When you load the <code>AlphaFold/2.3.4-foss-2022a-CUDA-11.7.0-ColabFold</code> module, this environment variable will be automatically set.
 
'''Note:''' This program does not work on the nodes with K20Xm GPU devices, because the CPUs on those nodes do not support AVX. If you run this program on the gpu_p partition, please request a K40 or a P100 GPU device. This version requires a GPU device.
 
 
*'''Version 2.2.4'''
 
This version is installed as a singularity container pulled from https://hub.docker.com/r/catgumag/alphafold:
<pre class="gscript">
/apps/singularity-images/alphafold_2.2.4.sif
</pre>
You can view the documentation for this version of AlphaFold with the following command, on an interactive node:
<pre class="gcommand">
singularity exec /apps/singularity-images/alphafold_2.2.4.sif python /app/alphafold/run_alphafold.py --helpfull
</pre>
This version works on nodes where the CPU processor is Intel, such as the P100 GPU nodes (note that this container does not work on the A100 node, which has an AMD processor).
 
The database files are installed in /apps/db/AlphaFold/2.2.4. To use these in the singularity container, please add the option <code>-B /apps/db/AlphaFold </code> to the singularity exec command, as shown in the sample job submission scripts below. The <code> --nv </code> option needs to be added to enable AlphaFold to run on the GPU. This singularity container also requires the option <code>--use_gpu_relax</code> to be added. 
 


*'''Version 2.3.1'''
'''Note:''' If you run this program on the gpu_p partition, please request a P100 or an A100 GPU device. This version requires a GPU device.


This version is installed as a singularity container pulled from https://hub.docker.com/r/catgumag/alphafold:
'''This version does not appear to be working now.'''
<pre class="gscript">
/apps/singularity-images/alphafold_2.3.1_cuda112.sif
</pre>
You can view the documentation for this version of AlphaFold with the following command, on an interactive node:
<pre class="gcommand">
singularity exec /apps/singularity-images/alphafold_2.3.1_cuda112.sif python /app/alphafold/run_alphafold.py --helpfull
</pre>
This version works on nodes where the CPU processor is Intel, such as the P100 GPU nodes (note that this container does not work on the A100 node, which has an AMD processor).
 
The database files are installed in /apps/db/AlphaFold/2.3.1. To use these in the singularity container, please add the option <code>-B /apps/db/AlphaFold </code> to the singularity exec command, as shown in the sample job submission scripts below. The <code> --nv </code> option needs to be added to enable AlphaFold to run on the GPU. Please also add the option <code>-B /apps/eb/CUDAcore/11.2.1 </code>, so the singularity container can link to the CUDA libraries. This singularity container also requires the option <code>--use_gpu_relax</code> to be added.




Line 341: Line 295:




Sample job submission script (sub.sh) to run AlphaFold 2.0.0 using run_alphafold.sh in a batch job (without GPU):
Sample job submission script (sub.sh) to run AlphaFold 2.3.1 in a batch job (with GPU):


<pre class="gscript">
<pre class="gscript">
#!/bin/bash
#!/bin/bash
#SBATCH --job-name=alphafoldjobname      
#SBATCH --job-name=alphafoldjobname  
#SBATCH --partition=batch           
#SBATCH --partition=gpu_p       
#SBATCH --ntasks=1                 
#SBATCH --ntasks=1                 
#SBATCH --cpus-per-task=4       
#SBATCH --cpus-per-task=10
#SBATCH --mem=20gb                    
#SBATCH --gres=gpu:A100:1
#SBATCH --mem=40gb                    
#SBATCH --time=120:00:00           
#SBATCH --time=120:00:00           
#SBATCH --output=%x.%j.out     
#SBATCH --output=%x.%j.out     
#SBATCH --error=%x.%j.err           
#SBATCH --error=%x.%j.err           
#SBATCH --mail-user=username@uga.edu 
#SBATCH --mail-type=ALL 


cd $SLURM_SUBMIT_DIR
cd $SLURM_SUBMIT_DIR


ml AlphaFold/2.0.0_conda
ml AlphaFold/2.3.1-foss-2022a-CUDA-11.7.0


bash $EBROOTALPHAFOLD/alphafold/run_alphafold.sh -d /apps/db/AlphaFold/2.0 [options]
alphafold [options]
</pre>  
 
</pre>
 
where [options] need to be replaced by the options (command and arguments) you want to use. Other parameters of the job, such as the maximum wall clock time, maximum memory, the number of cores per node, and the job name need to be modified appropriately as well.


An example of the required options to use are
An example of the options to use for the alphafold script:
<pre class="gscript">
<pre class="gscript">
bash $EBROOTALPHAFOLD/alphafold/run_alphafold.sh -d /apps/db/AlphaFold/2.0 -o ./test/ -m model_1 -f ./query.fasta -t 2020-05-14
alphafold --data_dir /apps/db/AlphaFold/2.3.1 --output_dir ./output --model_names model_1 --fasta_paths ./query.fasta --max_template_date 2021-11-17
</pre>
</pre>


Example of job submission
<pre  class="gcommand">
sbatch sub.sh
</pre>
=== A method to accelerate your calculations ===
If you have many AlphaFold calculations, i.e., need to process many FASTA files, you can run them using an array job (see [[Array_Jobs|here]]).


Sample job submission script (sub.sh) to run AlphaFold 2.0.0 using run_alphafold.sh in a batch job (with GPU):
At the same time you can greatly speed up your calculations by running two separate jobs using the CPU and GPU nodes sequentially:
 
1. Run the first step of AlphaFold (MSA generation) on the CPU nodes on the batch partition.
 
2. Then run the second part of AlphaFold (structural modeling) using the '''--use_precomputed_msas=true''' option on the GPU nodes on the gpu_p partition. If the computation in this part takes less than 4 hours, you can run the job using the V100 scavenge GPU nodes via batch partition (see [[GPU|here]]).
 
In order to achieve this, you need to find a way to stop all running element jobs (or elements) in step 1 after all element jobs in step 1 have completed the MSA generation.
 
 
With the help from a GACRC user, we have prepared a shell script called '''check_and_stop_elements.sh''':
 
1. This script will check for the presence of the "stop string", i.e., '''Running model model_1_multimer_v3_pred_0''', in the .err file of each running element you started in step 1.
 
2. You will run this script in an interactive session (see [[Running_Jobs_on_Sapelo2#How_to_open_an_interactive_session|here]]) or in an OOD X Desktop session (see [[OnDemand#X_Desktop_Session_.28A.K.A._The_Interactive_X_login_app.29|here]]) to monitor your elements during the time of elements running.
 
3. It will check your running elements with an interval of 5 minutes. If the "stop string" is found in a .err file of a running element, the script will automatically cancel the element for you, such that you can start step 2 as early as possible.
 
4. You can stop running the script (i.e. stop monitoring elements in step 1) by typing Ctrl + c keys on your keyboard.
 
 
As a quick demo, below is an example job submission script (sub_step1.sh) for running step 1 on batch:


<pre class="gscript">
<pre class="gscript">
#!/bin/bash
#!/bin/bash
#SBATCH --job-name=alphafoldjobname   
#SBATCH --job-name=alphaMSAs
#SBATCH --partition=gpu_p       
#SBATCH --partition=batch
#SBATCH --ntasks=1                
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:K40:1
#SBATCH --mem=32gb
#SBATCH --mem=40gb                   
#SBATCH --time=12:00:00
#SBATCH --time=120:00:00          
#SBATCH --output=%x.%j.out     
#SBATCH --output=%x.%j.out     
#SBATCH --error=%x.%j.err           
#SBATCH --error=%x.%j.err           
#SBATCH --mail-user=username@uga.edu 
#SBATCH --array=1-200
#SBATCH --mail-type=ALL 


cd $SLURM_SUBMIT_DIR
cd $SLURM_SUBMIT_DIR


ml AlphaFold/2.0.0_conda
ml purge
ml AlphaFold/2.3.1-foss-2022a-CUDA-11.7.0
export ALPHAFOLD_DATA_DIR=/apps/db/AlphaFold/2.3.1
 
file=$(awk "NR==${SLURM_ARRAY_TASK_ID}" input.lst)
 
alphafold \
--run_relax=False \
--data_dir=$ALPHAFOLD_DATA_DIR \
--uniref90_database_path=$ALPHAFOLD_DATA_DIR/uniref90/uniref90.fasta \
--mgnify_database_path=$ALPHAFOLD_DATA_DIR/mgnify/mgy_clusters.fa \
--bfd_database_path=$ALPHAFOLD_DATA_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--uniref30_database_path=$ALPHAFOLD_DATA_DIR/uniref30/UniRef30_2021_03 \
--pdb_seqres_database_path=$ALPHAFOLD_DATA_DIR/pdb_seqres/pdb_seqres.txt \
--template_mmcif_dir=$ALPHAFOLD_DATA_DIR/pdb_mmcif/mmcif_files \
--obsolete_pdbs_path=$ALPHAFOLD_DATA_DIR/pdb_mmcif/obsolete.dat \
--uniprot_database_path=$ALPHAFOLD_DATA_DIR/uniprot/uniprot.fasta \
--model_preset=multimer \
--num_multimer_predictions_per_model=1 \
--max_template_date=2023-10-01 \
--db_preset=full_dbs \
--output_dir=./outputs/$(basename $file .fa) \
--fasta_paths=./inputs/$file
</pre>


bash $EBROOTALPHAFOLD/alphafold/run_alphafold.sh -d /apps/db/AlphaFold/2.0 [options]
The above input.lst is a single-column text file storing the names of your FASTA files (please refer [[Array_Jobs#Non-Numbered_Input_Files|here]]), for example:
<pre class="gscript">
head -n 5 input.lst


PFago.v8prot.zeyw.fa
PFago.v8prot.zeyx.fa
PFago.v8prot.zeyy.fa
PFago.v8prot.zeyz.fa
PFago.v8prot.zeza.fa
</pre>
</pre>


where $EBROOTALPHAFOLD is the environmental variable that stores the AlphaFold installation path on the cluster; [options] need to be replaced by the options (command and arguments) you want to use. Other parameters of the job, such as the maximum wall clock time, maximum memory, the number of cores per node, and the job name need to be modified appropriately as well. You can also request a P100 device, using <code>#SBATCH --gres=gpu:P100:1</code> if you submit the job to the gpu_p partition.
and those FASTA files are stored in a folder called inputs in your current job working folder (--fasta_paths=./inputs/$file).




Sample job submission script (sub.sh) to run AlphaFold 2.0.1 in a batch job (with GPU):
Below is the shell script check_and_stop_elements.sh. You are welcome to copy it for your use:


<pre class="gscript">
<pre class="gscript">
#!/bin/bash
#!/bin/bash
#SBATCH --job-name=alphafoldjobname   
#SBATCH --partition=gpu_p       
#SBATCH --ntasks=1                 
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:P100:1
#SBATCH --mem=40gb                   
#SBATCH --time=120:00:00         
#SBATCH --output=%x.%j.out   
#SBATCH --error=%x.%j.err         
#SBATCH --mail-user=username@uga.edu 
#SBATCH --mail-type=ALL 


cd $SLURM_SUBMIT_DIR
ArrayID=$1
 
while true; do
  numOfRunning=$(squeue -l -j $ArrayID | grep RUNNING | wc -l)
  echo
  echo "Pending elements in array job $ArrayID:"
  squeue -l -j $ArrayID | grep PENDING
  echo
  echo "Number of running elements in array job $ArrayID:"
  echo "$numOfRunning"
  echo
  echo "Checking the stop string \"Running model model_1_multimer_v3_pred_0\" in alphaMSAs.*.err for each running element:"
  echo
 
  for j in $(squeue -l -j $ArrayID | grep RUNNING | awk '{print $1}')
  do
    jobid=$(scontrol show job $j | grep JobId= | awk '{print $1}' | sed -nr 's|JobId=()|\1|p')
    echo -n "Checking alphaMSAs.${jobid}.err ... "
    grep -m 1 "Running model model_1_multimer_v3_pred_0" alphaMSAs.${jobid}.err 2>&1 1>/dev/null
    if [ $? -eq 0 ]; then
      echo -en "\e[31mthe stop string is found!\e[0m ... go to cancel job ${jobid} ($j) ... scancel $j ... \e[32mjob canceled.\e[0m\n"
      scancel $j
    else
      echo -e "the stop string is NOT found!"
    fi
  done
  sleep 300
  clear
done
</pre>
 
Before you can run it, please change its mode to executable by:


ml AlphaFold/2.0.1-fosscuda-2020b
<pre class="gscript">
chmod 755 ./check_and_stop_elements.sh
</pre>


alphafold [options]
Then you can open an interactive session or an OOD X Desktop session to run this script to monitor your elements when they are running on compute nodes by:


<pre class="gscript">
./check_and_stop_elements.sh <ARRAY_JOB_ID>
</pre>
</pre>


where [options] need to be replaced by the options (command and arguments) you want to use. Other parameters of the job, such as the maximum wall clock time, maximum memory, the number of cores per node, and the job name need to be modified appropriately as well.
Please note. The above ARRAY_JOB_ID is the ID of your array job. For example, sq --me or sacct-gacrc -X reports you an JOBID 28026197_1, where 28026197 is the ID of your array job. So, the above command will be:


An example of the options to use for the alphafold script:
<pre class="gscript">
<pre class="gscript">
alphafold --data_dir /apps/db/AlphaFold/2.0 --output_dir ./output --model_names model_1 --fasta_paths ./query.fasta --max_template_date 2021-11-17
./check_and_stop_elements.sh 28026197
</pre>
</pre>




Sample job submission script (sub.sh) to run AlphaFold 2.1.1 in a batch job (with GPU):
Below is example outputs from running the above command line:


<pre class="gscript">
<pre class="gscript">
#!/bin/bash
./check_and_stop_elements.sh 28026197
#SBATCH --job-name=alphafoldjobname   
#SBATCH --partition=gpu_p       
#SBATCH --ntasks=1                 
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:P100:1
#SBATCH --mem=40gb                   
#SBATCH --time=120:00:00         
#SBATCH --output=%x.%j.out   
#SBATCH --error=%x.%j.err         
#SBATCH --mail-user=username@uga.edu 
#SBATCH --mail-type=ALL 


cd $SLURM_SUBMIT_DIR
Pending elements in array job 28026197:


ml AlphaFold/2.1.1-fosscuda-2020b
Number of running elements in array job 28026197:
6


alphafold [options]
Checking the stop string "Running model model_1_multimer_v3_pred_0" in alphaMSAs.*.err for each running element:


Checking alphaMSAs.28026197.err ... the stop string is found! ... go to cancel job 28026197 (28026197_10) ... scancel 28026197_10 ... job canceled.
Checking alphaMSAs.28026206.err ... the stop string is found! ... go to cancel job 28026206 (28026197_9) ... scancel 28026197_9 ... job canceled.
Checking alphaMSAs.28026205.err ... the stop string is NOT found!
Checking alphaMSAs.28026201.err ... the stop string is NOT found!
Checking alphaMSAs.28026200.err ... the stop string is NOT found!
Checking alphaMSAs.28026199.err ... the stop string is NOT found!
</pre>
</pre>


where [options] need to be replaced by the options (command and arguments) you want to use. Other parameters of the job, such as the maximum wall clock time, maximum memory, the number of cores per node, and the job name need to be modified appropriately as well.


An example of the options to use for the alphafold script:
Once all elements in step 1 have completed the MSA generation, you can go to step 2. As mentioned above, if the computation in step 2 takes less than 4 hours, you can run the job using the V100 scavenge GPU nodes via batch partition. Below is an example job submission script (sub_step2.sh) for running step 2 on the V100 scavenge GPU nodes:
 
<pre class="gscript">
<pre class="gscript">
alphafold --data_dir /apps/db/AlphaFold/2.1 --output_dir ./output --model_names model_1 --fasta_paths ./query.fasta --max_template_date 2021-11-17
#!/bin/bash
</pre>
#SBATCH --job-name=alphafold
#SBATCH --partition=batch
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32gb
#SBATCH --gres=gpu:V100:1
#SBATCH --time=4:00:00
#SBATCH --output=%x.%j.out   
#SBATCH --error=%x.%j.err         
#SBATCH --array=1-200


Example of job submission
cd $SLURM_SUBMIT_DIR
<pre  class="gcommand">
sbatch sub.sh
</pre>


=== Documentation ===
ml purge


Details and references are at https://github.com/deepmind/alphafold.
ml AlphaFold/2.3.1-foss-2022a-CUDA-11.7.0
export ALPHAFOLD_DATA_DIR=/apps/db/AlphaFold/2.3.1
export TF_FORCE_UNIFIED_MEMORY=1
export XLA_PYTHON_CLIENT_MEM_FRACTION=8


'''Version 2.0.0:'''
file=$(awk "NR==${SLURM_ARRAY_TASK_ID}" input.lst)
<pre  class="gcommand">
ml AlphaFold/2.0.0_conda


bash $EBROOTALPHAFOLD/alphafold/run_alphafold.sh -h
alphafold \
--run_relax=False \
--data_dir=$ALPHAFOLD_DATA_DIR \
--uniref90_database_path=$ALPHAFOLD_DATA_DIR/uniref90/uniref90.fasta \
--mgnify_database_path=$ALPHAFOLD_DATA_DIR/mgnify/mgy_clusters.fa \
--bfd_database_path=$ALPHAFOLD_DATA_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--uniref30_database_path=$ALPHAFOLD_DATA_DIR/uniref30/UniRef30_2021_03 \
--pdb_seqres_database_path=$ALPHAFOLD_DATA_DIR/pdb_seqres/pdb_seqres.txt \
--template_mmcif_dir=$ALPHAFOLD_DATA_DIR/pdb_mmcif/mmcif_files \
--obsolete_pdbs_path=$ALPHAFOLD_DATA_DIR/pdb_mmcif/obsolete.dat \
--uniprot_database_path=$ALPHAFOLD_DATA_DIR/uniprot/uniprot.fasta \
--model_preset=multimer \
--use_precomputed_msas=true \
--num_multimer_predictions_per_model=1 \
--max_template_date=2023-10-01 \
--db_preset=full_dbs \
--output_dir=./outputs/$(basename $file .fa) \
--fasta_paths=./inputs/$file
</pre>


Usage: /apps/gb/AlphaFold/2.0.0_conda/alphafold/run_alphafold.sh <OPTIONS>
Please note. We use '''--use_precomputed_msas=true''' option in the above alphafold command line.
Required Parameters:
-d <data_dir>    Path to directory of supporting data
-o <output_dir>  Path to a directory that will store the results.
-m <model_names>  Names of models to use (a comma separated list)
-f <fasta_path>  Path to a FASTA file containing one sequence
-t <max_template_date> Maximum template release date to consider (ISO-8601 format - i.e. YYYY-MM-DD). Important if folding historical test sets
Optional Parameters:
-b <benchmark>    Run multiple JAX model evaluations to obtain a timing that excludes the compilation time, which should be more indicative of the time required for inferencing many
    proteins (default: 'False')
-g <use_gpu>      Enable NVIDIA runtime to run with GPUs (default: 'True')
-a <gpu_devices>  Comma separated list of devices to pass to 'CUDA_VISIBLE_DEVICES' (default: 'all')
-p <preset>      Choose preset model configuration - no ensembling (full_dbs) or 8 model ensemblings (casp14) (default: 'full_dbs')


=== Documentation ===


</pre>
Details and references are at https://github.com/deepmind/alphafold.


'''Version 2.0.1:''' Short help options
'''Version 2.3.4:''' Short help options
<pre  class="gcommand">
<pre  class="gcommand">
ml AlphaFold/2.0.1-fosscuda-2020b
ml AlphaFold/2.3.4-foss-2022a-CUDA-11.7.0-ColabFold


export ALPHAFOLD_DATA_DIR=/apps/db/AlphaFold/2.0
export ALPHAFOLD_DATA_DIR=/apps/db/AlphaFold/2.3.4


alphafold --helpshort
alphafold --helpshort
/apps/eb/jax/0.2.19-fosscuda-2020b/lib/python3.8/site-packages/absl/flags/_validators.py:203: UserWarning: Flag --preset has a non-None default value; therefore, mark_flag_as_required will pass even if flag is not specified in the command line!
 
  warnings.warn(
Full AlphaFold protein structure prediction script.
Full AlphaFold protein structure prediction script.
flags:
flags:


/apps/eb/AlphaFold/2.0.1-fosscuda-2020b/bin/alphafold:
/apps/eb/AlphaFold/2.3.4-foss-2022a-CUDA-11.7.0-ColabFold/bin/alphafold:
   --[no]benchmark: Run multiple JAX model evaluations to obtain a timing that
   --[no]benchmark: Run multiple JAX model evaluations to obtain a timing that excludes the compilation time, which should be more indicative of the time
    excludes the compilation time, which should be more indicative of the time
     required for inferencing many proteins.
     required for inferencing many proteins.
     (default: 'false')
     (default: 'false')
   --bfd_database_path: Path to the BFD database for use by HHblits.
   --bfd_database_path: Path to the BFD database for use by HHblits.
     (default: '/apps/db/AlphaFold/bfd/bfd_metaclust_clu_complete_id30_c90_final_
     (default: '/apps/db/AlphaFold/2.3.4/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt')
    seq.sorted_opt')
   --data_dir: Path to directory of supporting data.
   --data_dir: Path to directory of supporting data.
     (default: '/apps/db/AlphaFold/2.0')
     (default: '/apps/db/AlphaFold/2.3.4')
   --fasta_paths: Paths to FASTA files, each containing one sequence. Paths
  --db_preset: <full_dbs|reduced_dbs>: Choose preset MSA database configuration - smaller genetic database config (reduced_dbs) or full genetic database config
     should be separated by commas. All FASTA paths must have a unique basename
    (full_dbs)
    as the basename is used to name the output directories for each prediction.
    (default: 'full_dbs')
   --fasta_paths: Paths to FASTA files, each containing a prediction target that will be folded one after another. If a FASTA file contains multiple sequences,
     then it will be folded as a multimer. Paths should be separated by commas. All FASTA paths must have a unique basename as the basename is used to name the
    output directories for each prediction.
     (a comma separated list)
     (a comma separated list)
   --hhblits_binary_path: Path to the HHblits executable.
   --hhblits_binary_path: Path to the HHblits executable.
     (default: '/apps/eb/HH-suite/3.3.0-gompic-2020b/bin/hhblits')
     (default: '/apps/eb/HH-suite/3.3.0-gompi-2022a/bin/hhblits')
   --hhsearch_binary_path: Path to the HHsearch executable.
   --hhsearch_binary_path: Path to the HHsearch executable.
     (default: '/apps/eb/HH-suite/3.3.0-gompic-2020b/bin/hhsearch')
     (default: '/apps/eb/HH-suite/3.3.0-gompi-2022a/bin/hhsearch')
  --hmmbuild_binary_path: Path to the hmmbuild executable.
    (default: '/apps/eb/HMMER/3.3.2-gompi-2022a/bin/hmmbuild')
  --hmmsearch_binary_path: Path to the hmmsearch executable.
    (default: '/apps/eb/HMMER/3.3.2-gompi-2022a/bin/hmmsearch')
   --jackhmmer_binary_path: Path to the JackHMMER executable.
   --jackhmmer_binary_path: Path to the JackHMMER executable.
     (default: '/apps/eb/HMMER/3.3.2-gompic-2020b/bin/jackhmmer')
     (default: '/apps/eb/HMMER/3.3.2-gompi-2022a/bin/jackhmmer')
   --kalign_binary_path: Path to the Kalign executable.
   --kalign_binary_path: Path to the Kalign executable.
     (default: '/apps/eb/Kalign/3.3.1-GCCcore-10.2.0/bin/kalign')
     (default: '/apps/eb/Kalign/3.3.5-GCCcore-11.3.0/bin/kalign')
   --max_template_date: Maximum template release date to consider. Important if
   --max_template_date: Maximum template release date to consider. Important if folding historical test sets.
    folding historical test sets.
   --mgnify_database_path: Path to the MGnify database for use by JackHMMER.
   --mgnify_database_path: Path to the MGnify database for use by JackHMMER.
     (default: '/apps/db/AlphaFold/mgnify/mgy_clusters.fa')
     (default: '/apps/db/AlphaFold/2.3.4/mgnify/mgy_clusters_2022_05.fa')
   --model_names: Names of models to use.
   --model_preset: <monomer|monomer_casp14|monomer_ptm|multimer>: Choose preset model configuration - the monomer model, the monomer model with extra ensembling,
     (a comma separated list)
    monomer model with pTM head, or multimer model
   --obsolete_pdbs_path: Path to file containing a mapping from obsolete PDB IDs
    (default: 'monomer')
    to the PDB IDs of their replacements.
  --models_to_relax: <all|best|none>: The models to run the final relaxation step on. If `all`, all models are relaxed, which may be time consuming. If `best`,
     (default: '/apps/db/AlphaFold/pdb_mmcif/obsolete.dat')
    only the most confident model is relaxed. If `none`, relaxation is not run. Turning off relaxation might result in predictions with distracting
    stereochemical violations but might help in case you are having issues with the relaxation stage.
     (default: 'best')
  --num_multimer_predictions_per_model: How many predictions (each with a different random seed) will be generated per model. E.g. if this is 2 and there are 5
    models then there will be 10 predictions per input. Note: this FLAG only applies if model_preset=multimer
    (default: '5')
    (an integer)
   --obsolete_pdbs_path: Path to file containing a mapping from obsolete PDB IDs to the PDB IDs of their replacements.
     (default: '/apps/db/AlphaFold/2.3.4/pdb_mmcif/obsolete.dat')
   --output_dir: Path to a directory that will store the results.
   --output_dir: Path to a directory that will store the results.
   --pdb70_database_path: Path to the PDB70 database for use by HHsearch.
   --pdb70_database_path: Path to the PDB70 database for use by HHsearch.
     (default: '/apps/db/AlphaFold/pdb70/pdb70')
     (default: '/apps/db/AlphaFold/2.3.4/pdb70/pdb70')
   --preset: <reduced_dbs|full_dbs|casp14>: Choose preset model configuration -
   --pdb_seqres_database_path: Path to the PDB seqres database for use by hmmsearch.
    no ensembling and smaller genetic database config (reduced_dbs), no
   --random_seed: The random seed for the data pipeline. By default, this is randomly generated. Note that even if this is set, Alphafold may still not be
    ensembling and full genetic database config  (full_dbs) or full genetic
     deterministic, because processes like GPU inference are nondeterministic.
    database config and 8 model ensemblings (casp14).
    (default: 'full_dbs')
   --random_seed: The random seed for the data pipeline. By default, this is
    randomly generated. Note that even if this is set, Alphafold may still not
     be deterministic, because processes like GPU inference are nondeterministic.
     (an integer)
     (an integer)
   --small_bfd_database_path: Path to the small version of BFD used with the
   --small_bfd_database_path: Path to the small version of BFD used with the "reduced_dbs" preset.
    "reduced_dbs" preset.
   --template_mmcif_dir: Path to a directory with template mmCIF structures, each named <pdb_id>.cif
   --template_mmcif_dir: Path to a directory with template mmCIF structures, each
     (default: '/apps/db/AlphaFold/2.3.4/pdb_mmcif/mmcif_files')
    named <pdb_id>.cif
   --uniprot_database_path: Path to the Uniprot database for use by JackHMMer.
     (default: '/apps/db/AlphaFold/pdb_mmcif/mmcif_files')
  --uniref30_database_path: Path to the UniRef30 database for use by HHblits.
   --uniclust30_database_path: Path to the Uniclust30 database for use by
     (default: '/apps/db/AlphaFold/2.3.4/uniref30/UniRef30_2021_03')
    HHblits.
     (default:
    '/apps/db/AlphaFold/uniclust30/uniclust30_2018_08/uniclust30_2018_08')
   --uniref90_database_path: Path to the Uniref90 database for use by JackHMMER.
   --uniref90_database_path: Path to the Uniref90 database for use by JackHMMER.
     (default: '/apps/db/AlphaFold/uniref90/uniref90.fasta')
     (default: '/apps/db/AlphaFold/2.3.4/uniref90/uniref90.fasta')
  --[no]use_gpu_relax: Whether to relax on GPU. Relax on GPU can be much faster than CPU, so it is recommended to enable if possible. GPUs must be available if
    this setting is enabled.
    (default: 'true')
  --[no]use_precomputed_msas: Whether to read MSAs that have been written to disk instead of running the MSA tools. The MSA files are looked up in the output
    directory, so it must stay the same between multiple runs that are to reuse the MSAs. WARNING: This will not check if the sequence, database or
    configuration have changed.
    (default: 'false')


Try --helpfull to get a list of all flags.
Try --helpfull to get a list of all flags.
Line 562: Line 617:
</pre>
</pre>


'''Version 2.0.1:''' Full help options
'''Version 2.3.4:''' Full help options
<pre  class="gcommand">
<pre  class="gcommand">
ml AlphaFold/2.0.1-fosscuda-2020b
ml AlphaFold/2.3.4-foss-2022a-CUDA-11.7.0-ColabFold


export ALPHAFOLD_DATA_DIR=/apps/db/AlphaFold/2.0
export ALPHAFOLD_DATA_DIR=/apps/db/AlphaFold/2.3.4


alphafold --helpfull
alphafold --helpfull
/apps/eb/jax/0.2.19-fosscuda-2020b/lib/python3.8/site-packages/absl/flags/_validators.py:203: UserWarning: Flag --preset has a non-None default value; therefore, mark_flag_as_required will pass even if flag is not specified in the command line!
 
  warnings.warn(
Full AlphaFold protein structure prediction script.
Full AlphaFold protein structure prediction script.
flags:
flags:


/apps/eb/AlphaFold/2.0.1-fosscuda-2020b/bin/alphafold:
/apps/eb/AlphaFold/2.3.4-foss-2022a-CUDA-11.7.0-ColabFold/bin/alphafold:
   --[no]benchmark: Run multiple JAX model evaluations to obtain a timing that
   --[no]benchmark: Run multiple JAX model evaluations to obtain a timing that excludes the compilation time, which should be more indicative of the time
    excludes the compilation time, which should be more indicative of the time
     required for inferencing many proteins.
     required for inferencing many proteins.
     (default: 'false')
     (default: 'false')
   --bfd_database_path: Path to the BFD database for use by HHblits.
   --bfd_database_path: Path to the BFD database for use by HHblits.
     (default: '/apps/db/AlphaFold/bfd/bfd_metaclust_clu_complete_id30_c90_final_
     (default: '/apps/db/AlphaFold/2.3.4/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt')
    seq.sorted_opt')
   --data_dir: Path to directory of supporting data.
   --data_dir: Path to directory of supporting data.
     (default: '/apps/db/AlphaFold/2.0')
     (default: '/apps/db/AlphaFold/2.3.4')
   --fasta_paths: Paths to FASTA files, each containing one sequence. Paths
  --db_preset: <full_dbs|reduced_dbs>: Choose preset MSA database configuration - smaller genetic database config (reduced_dbs) or full genetic database config
     should be separated by commas. All FASTA paths must have a unique basename
    (full_dbs)
    as the basename is used to name the output directories for each prediction.
    (default: 'full_dbs')
   --fasta_paths: Paths to FASTA files, each containing a prediction target that will be folded one after another. If a FASTA file contains multiple sequences,
     then it will be folded as a multimer. Paths should be separated by commas. All FASTA paths must have a unique basename as the basename is used to name the
    output directories for each prediction.
     (a comma separated list)
     (a comma separated list)
   --hhblits_binary_path: Path to the HHblits executable.
   --hhblits_binary_path: Path to the HHblits executable.
     (default: '/apps/eb/HH-suite/3.3.0-gompic-2020b/bin/hhblits')
     (default: '/apps/eb/HH-suite/3.3.0-gompi-2022a/bin/hhblits')
   --hhsearch_binary_path: Path to the HHsearch executable.
   --hhsearch_binary_path: Path to the HHsearch executable.
     (default: '/apps/eb/HH-suite/3.3.0-gompic-2020b/bin/hhsearch')
     (default: '/apps/eb/HH-suite/3.3.0-gompi-2022a/bin/hhsearch')
  --hmmbuild_binary_path: Path to the hmmbuild executable.
    (default: '/apps/eb/HMMER/3.3.2-gompi-2022a/bin/hmmbuild')
  --hmmsearch_binary_path: Path to the hmmsearch executable.
    (default: '/apps/eb/HMMER/3.3.2-gompi-2022a/bin/hmmsearch')
   --jackhmmer_binary_path: Path to the JackHMMER executable.
   --jackhmmer_binary_path: Path to the JackHMMER executable.
     (default: '/apps/eb/HMMER/3.3.2-gompic-2020b/bin/jackhmmer')
     (default: '/apps/eb/HMMER/3.3.2-gompi-2022a/bin/jackhmmer')
   --kalign_binary_path: Path to the Kalign executable.
   --kalign_binary_path: Path to the Kalign executable.
     (default: '/apps/eb/Kalign/3.3.1-GCCcore-10.2.0/bin/kalign')
     (default: '/apps/eb/Kalign/3.3.5-GCCcore-11.3.0/bin/kalign')
   --max_template_date: Maximum template release date to consider. Important if
   --max_template_date: Maximum template release date to consider. Important if folding historical test sets.
    folding historical test sets.
   --mgnify_database_path: Path to the MGnify database for use by JackHMMER.
   --mgnify_database_path: Path to the MGnify database for use by JackHMMER.
     (default: '/apps/db/AlphaFold/mgnify/mgy_clusters.fa')
     (default: '/apps/db/AlphaFold/2.3.4/mgnify/mgy_clusters_2022_05.fa')
   --model_names: Names of models to use.
   --model_preset: <monomer|monomer_casp14|monomer_ptm|multimer>: Choose preset model configuration - the monomer model, the monomer model with extra ensembling,
     (a comma separated list)
    monomer model with pTM head, or multimer model
   --obsolete_pdbs_path: Path to file containing a mapping from obsolete PDB IDs
    (default: 'monomer')
    to the PDB IDs of their replacements.
  --models_to_relax: <all|best|none>: The models to run the final relaxation step on. If `all`, all models are relaxed, which may be time consuming. If `best`,
     (default: '/apps/db/AlphaFold/pdb_mmcif/obsolete.dat')
    only the most confident model is relaxed. If `none`, relaxation is not run. Turning off relaxation might result in predictions with distracting
    stereochemical violations but might help in case you are having issues with the relaxation stage.
     (default: 'best')
  --num_multimer_predictions_per_model: How many predictions (each with a different random seed) will be generated per model. E.g. if this is 2 and there are 5
    models then there will be 10 predictions per input. Note: this FLAG only applies if model_preset=multimer
    (default: '5')
    (an integer)
   --obsolete_pdbs_path: Path to file containing a mapping from obsolete PDB IDs to the PDB IDs of their replacements.
     (default: '/apps/db/AlphaFold/2.3.4/pdb_mmcif/obsolete.dat')
   --output_dir: Path to a directory that will store the results.
   --output_dir: Path to a directory that will store the results.
   --pdb70_database_path: Path to the PDB70 database for use by HHsearch.
   --pdb70_database_path: Path to the PDB70 database for use by HHsearch.
     (default: '/apps/db/AlphaFold/pdb70/pdb70')
     (default: '/apps/db/AlphaFold/2.3.4/pdb70/pdb70')
   --preset: <reduced_dbs|full_dbs|casp14>: Choose preset model configuration -
   --pdb_seqres_database_path: Path to the PDB seqres database for use by hmmsearch.
    no ensembling and smaller genetic database config (reduced_dbs), no
   --random_seed: The random seed for the data pipeline. By default, this is randomly generated. Note that even if this is set, Alphafold may still not be
    ensembling and full genetic database config  (full_dbs) or full genetic
     deterministic, because processes like GPU inference are nondeterministic.
    database config and 8 model ensemblings (casp14).
    (default: 'full_dbs')
   --random_seed: The random seed for the data pipeline. By default, this is
    randomly generated. Note that even if this is set, Alphafold may still not
     be deterministic, because processes like GPU inference are nondeterministic.
     (an integer)
     (an integer)
   --small_bfd_database_path: Path to the small version of BFD used with the
   --small_bfd_database_path: Path to the small version of BFD used with the "reduced_dbs" preset.
    "reduced_dbs" preset.
   --template_mmcif_dir: Path to a directory with template mmCIF structures, each named <pdb_id>.cif
   --template_mmcif_dir: Path to a directory with template mmCIF structures, each
     (default: '/apps/db/AlphaFold/2.3.4/pdb_mmcif/mmcif_files')
    named <pdb_id>.cif
   --uniprot_database_path: Path to the Uniprot database for use by JackHMMer.
     (default: '/apps/db/AlphaFold/pdb_mmcif/mmcif_files')
  --uniref30_database_path: Path to the UniRef30 database for use by HHblits.
   --uniclust30_database_path: Path to the Uniclust30 database for use by
     (default: '/apps/db/AlphaFold/2.3.4/uniref30/UniRef30_2021_03')
    HHblits.
     (default:
    '/apps/db/AlphaFold/uniclust30/uniclust30_2018_08/uniclust30_2018_08')
   --uniref90_database_path: Path to the Uniref90 database for use by JackHMMER.
   --uniref90_database_path: Path to the Uniref90 database for use by JackHMMER.
     (default: '/apps/db/AlphaFold/uniref90/uniref90.fasta')
     (default: '/apps/db/AlphaFold/2.3.4/uniref90/uniref90.fasta')
  --[no]use_gpu_relax: Whether to relax on GPU. Relax on GPU can be much faster than CPU, so it is recommended to enable if possible. GPUs must be available if
    this setting is enabled.
    (default: 'true')
  --[no]use_precomputed_msas: Whether to read MSAs that have been written to disk instead of running the MSA tools. The MSA files are looked up in the output
    directory, so it must stay the same between multiple runs that are to reuse the MSAs. WARNING: This will not check if the sequence, database or
    configuration have changed.
    (default: 'false')


absl.app:
absl.app:
Line 642: Line 707:
   --[no]pdb: Alias for --pdb_post_mortem.
   --[no]pdb: Alias for --pdb_post_mortem.
     (default: 'false')
     (default: 'false')
   --[no]pdb_post_mortem: Set to true to handle uncaught exceptions with PDB post
   --[no]pdb_post_mortem: Set to true to handle uncaught exceptions with PDB post mortem.
    mortem.
     (default: 'false')
     (default: 'false')
   --profile_file: Dump profile information to a file (for python -m pstats).
   --profile_file: Dump profile information to a file (for python -m pstats). Implies --run_with_profiling.
    Implies --run_with_profiling.
   --[no]run_with_pdb: Set to true for PDB debug mode
   --[no]run_with_pdb: Set to true for PDB debug mode
     (default: 'false')
     (default: 'false')
   --[no]run_with_profiling: Set to true for profiling the script. Execution will
   --[no]run_with_profiling: Set to true for profiling the script. Execution will be slower, and the output format might change over time.
    be slower, and the output format might change over time.
     (default: 'false')
     (default: 'false')
   --[no]use_cprofile_for_profiling: Use cProfile instead of the profile module
   --[no]use_cprofile_for_profiling: Use cProfile instead of the profile module for profiling. This has no effect unless --run_with_profiling is set.
    for profiling. This has no effect unless --run_with_profiling is set.
     (default: 'true')
     (default: 'true')


Line 661: Line 722:
   --log_dir: directory to write logfiles into
   --log_dir: directory to write logfiles into
     (default: '')
     (default: '')
   --logger_levels: Specify log level of loggers. The format is a CSV list of
   --logger_levels: Specify log level of loggers. The format is a CSV list of `name:level`. Where `name` is the logger name used with `logging.getLogger()`, and
    `name:level`. Where `name` is the logger name used with
    `level` is a level name  (INFO, DEBUG, etc). e.g. `myapp.foo:INFO,other.logger:DEBUG`
    `logging.getLogger()`, and `level` is a level name  (INFO, DEBUG, etc). e.g.
    `myapp.foo:INFO,other.logger:DEBUG`
     (default: '')
     (default: '')
   --[no]logtostderr: Should only log to stderr?
   --[no]logtostderr: Should only log to stderr?
     (default: 'false')
     (default: 'false')
   --[no]showprefixforinfo: If False, do not prepend prefix to info messages when
   --[no]showprefixforinfo: If False, do not prepend prefix to info messages when it's logged to stderr, --verbosity is set to INFO level, and python logging is
    it's logged to stderr, --verbosity is set to INFO level, and python logging
     used.
     is used.
     (default: 'true')
     (default: 'true')
   --stderrthreshold: log messages at this level, or more severe, to stderr in
   --stderrthreshold: log messages at this level, or more severe, to stderr in addition to the logfile.  Possible values are 'debug', 'info', 'warning', 'error',
    addition to the logfile.  Possible values are 'debug', 'info', 'warning',
    and 'fatal'.  Obsoletes --alsologtostderr. Using --alsologtostderr cancels the effect of this flag. Please also note that this flag is subject to
    'error', and 'fatal'.  Obsoletes --alsologtostderr. Using --alsologtostderr
     --verbosity and requires logfile not be stderr.
    cancels the effect of this flag. Please also note that this flag is subject
     to --verbosity and requires logfile not be stderr.
     (default: 'fatal')
     (default: 'fatal')
   -v,--verbosity: Logging verbosity level. Messages logged at this level or
   -v,--verbosity: Logging verbosity level. Messages logged at this level or lower will be included. Set to 1 for debug logging. If the flag was not set or
    lower will be included. Set to 1 for debug logging. If the flag was not set
     supplied, the value will be changed from the default of -1 (warning) to 0 (info) after flags are parsed.
     or supplied, the value will be changed from the default of -1 (warning) to 0
    (info) after flags are parsed.
     (default: '-1')
     (default: '-1')
     (an integer)
     (an integer)


absl.testing.absltest:
absl.testing.absltest:
   --test_random_seed: Random seed for testing. Some test frameworks may change
   --test_random_seed: Random seed for testing. Some test frameworks may change the default value of this flag between runs, so it is not appropriate for seeding
    the default value of this flag between runs, so it is not appropriate for
     probabilistic tests.
     seeding probabilistic tests.
     (default: '301')
     (default: '301')
     (an integer)
     (an integer)
   --test_randomize_ordering_seed: If positive, use this as a seed to randomize
   --test_randomize_ordering_seed: If positive, use this as a seed to randomize the execution order for test cases. If "random", pick a random seed to use. If 0
    the execution order for test cases. If "random", pick a random seed to use.
     or not set, do not randomize test case execution order. This flag also overrides the TEST_RANDOMIZE_ORDERING_SEED environment variable.
     If 0 or not set, do not randomize test case execution order. This flag also
    overrides the TEST_RANDOMIZE_ORDERING_SEED environment variable.
     (default: '')
     (default: '')
   --test_srcdir: Root of directory tree where source files live
   --test_srcdir: Root of directory tree where source files live
Line 712: Line 763:
   --[no]runtime_oom_exit: Exit the script when the TPU runtime is OOM.
   --[no]runtime_oom_exit: Exit the script when the TPU runtime is OOM.
     (default: 'true')
     (default: 'true')
tensorflow.python.tpu.tensor_tracer_flags:
  --delta_threshold: Log if history based diff crosses this threshold.
    (default: '0.5')
    (a number)
  --[no]tt_check_filter: Terminate early to check op name filtering.
    (default: 'false')
  --[no]tt_single_core_summaries: Report single core metric and avoid aggregation.
    (default: 'false')


absl.flags:
absl.flags:
   --flagfile: Insert flag definitions from the given file into the command line.
   --flagfile: Insert flag definitions from the given file into the command line.
     (default: '')
     (default: '')
   --undefok: comma-separated list of flag names that it is okay to specify on
   --undefok: comma-separated list of flag names that it is okay to specify on the command line even if the program does not define a flag with that name.
    the command line even if the program does not define a flag with that name.
     IMPORTANT: flags in this list that have arguments MUST use the --flag=value format.
     IMPORTANT: flags in this list that have arguments MUST use the --flag=value
    format.
     (default: '')
     (default: '')


Line 728: Line 786:
=== Installation ===
=== Installation ===


*Version 2.0.0: Installed using a conda environment following the steps in the dockerfile available at https://github.com/deepmind/alphafold. The run_alphafold.sh bash script was obtained from https://github.com/kalininalab/alphafold_non_docker and some documentation related to this script is available at that URL.
*Version 2.2.4: Installed as a singularity container pulled from https://hub.docker.com/r/catgumag/alphafold
 
*Version 2.0.1: Installed using EasyBuild.
 
*Version 2.1.0: Installed using EasyBuild.
 
*Version 2.1.1: Installed using EasyBuild.


*Version 2.2.0: Installed using EasyBuild.
*Version 2.3.1: Installed as a singularity container pulled from https://hub.docker.com/r/catgumag/alphafold  
 
*Version 2.2.4: Installed as a singularity container pulled from https://hub.docker.com/r/catgumag/alphafold


*Version 2.3.1: Installed using EasyBuild.


*The database files are installed in /apps/db/AlphaFold/
*The database files are installed in /apps/db/AlphaFold/

Latest revision as of 11:59, 8 May 2024


Category

Bioinformatics

Program On

Sapelo2

Version

2.2.4, 2.3.1, 2.3.4

Author / Distributor

Please see https://github.com/deepmind/alphafold

Description

From https://github.com/deepmind/alphafold: "This package provides an implementation of the inference pipeline of AlphaFold v2.0. This is a completely new model that was entered in CASP14 and published in Nature. "

Running Program

Also refer to Running Jobs on Sapelo2

For more information on Environment Modules on Sapelo2 please see the Lmod page.


  • Version 2.2.4

This version is installed as a singularity container pulled from https://hub.docker.com/r/catgumag/alphafold:

/apps/singularity-images/alphafold_2.2.4.sif

You can view the documentation for this version of AlphaFold with the following command, on an interactive node:

singularity exec /apps/singularity-images/alphafold_2.2.4.sif python /app/alphafold/run_alphafold.py --helpfull

This version works on nodes where the CPU processor is Intel, such as the P100 GPU nodes (note that this container does not work on the A100 node, which has an AMD processor).

The database files are installed in /apps/db/AlphaFold/2.2.4. To use these in the singularity container, please add the option -B /apps/db/AlphaFold to the singularity exec command, as shown in the sample job submission scripts below. The --nv option needs to be added to enable AlphaFold to run on the GPU. This singularity container also requires the option --use_gpu_relax to be added.


  • Version 2.3.1 (on Intel nodes only)

This version is installed as a singularity container pulled from https://hub.docker.com/r/catgumag/alphafold:

/apps/singularity-images/alphafold_2.3.1_cuda112.sif

You can view the documentation for this version of AlphaFold with the following command, on an interactive node:

singularity exec /apps/singularity-images/alphafold_2.3.1_cuda112.sif python /app/alphafold/run_alphafold.py --helpfull

This version works on nodes where the CPU processor is Intel, such as the P100 GPU nodes (note that this container does not work on the A100 node, which has an AMD processor).

The database files are installed in /apps/db/AlphaFold/2.3.1. To use these in the singularity container, please add the option -B /apps/db/AlphaFold to the singularity exec command, as shown in the sample job submission scripts below. The --nv option needs to be added to enable AlphaFold to run on the GPU. Please also add the option -B /apps/eb/CUDAcore/11.2.1 , so the singularity container can link to the CUDA libraries. This singularity container also requires the option --use_gpu_relax to be added.


  • Version 2.3.1

Installed with EasyBuild in /apps/eb/AlphaFold/2.3.1-foss-2022a-CUDA-11.7.0/

To use this version of AlphaFold, please first load the module with

ml AlphaFold/2.3.1-foss-2022a-CUDA-11.7.0

Once you load the module, an environmental variable called EBROOTALPHAFOLD is exported. It stores the AlphaFold installation path on the cluster, i.e., /apps/eb/AlphaFold/2.3.1-foss-2022a-CUDA-11.7.0. The python script run_alphafold.py is installed in EBROOTALPHAFOLD/bin and a symbolic link called alphafold points to it and can be used to run the program. The 2.2TB of database files are in /apps/db/AlphaFold/2.3.1. You can export the environment variable ALPHAFOLD_DATA_DIR to set the location of the database files. For bash, use

export ALPHAFOLD_DATA_DIR=/apps/db/AlphaFold/2.3.1

When you load the AlphaFold/2.3.1-foss-2022a-CUDA-11.7.0 module, this environment variable will be automatically set.

Note: If you run this program on the gpu_p partition, please request a P100 or an A100 GPU device. This version requires a GPU device and it should work on the P100, V100, and A100 devices.


  • Version 2.3.4

Installed with EasyBuild in /apps/eb/AlphaFold/2.3.4-foss-2022a-CUDA-11.7.0-ColabFold/

To use this version of AlphaFold, please first load the module with

ml AlphaFold/2.3.4-foss-2022a-CUDA-11.7.0-ColabFold

Once you load the module, an environmental variable called EBROOTALPHAFOLD is exported. It stores the AlphaFold installation path on the cluster, i.e., /apps/eb/AlphaFold/2.3.4-foss-2022a-CUDA-11.7.0-ColabFold. The python script run_alphafold.py is installed in EBROOTALPHAFOLD/bin and a symbolic link called alphafold points to it and can be used to run the program. The 2.2TB of database files are in /apps/db/AlphaFold/2.3.4. You can export the environment variable ALPHAFOLD_DATA_DIR to set the location of the database files. For bash, use

export ALPHAFOLD_DATA_DIR=/apps/db/AlphaFold/2.3.4

When you load the AlphaFold/2.3.4-foss-2022a-CUDA-11.7.0-ColabFold module, this environment variable will be automatically set.

Note: If you run this program on the gpu_p partition, please request a P100 or an A100 GPU device. This version requires a GPU device.

This version does not appear to be working now.


Sample Job Submission scripts

Sample job submission script to run the singularity container for v. 2.3.1 for Monomer on a GPU:

#!/bin/bash
#SBATCH --job-name=alphafoldjobname       
#SBATCH --partition=gpu_p         
#SBATCH --ntasks=1                  	
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:P100:1
#SBATCH --mem=50gb        
#SBATCH --constraint=Intel            
#SBATCH --time=120:00:00           
#SBATCH --output=%x.%j.out     
#SBATCH --error=%x.%j.err          

cd $SLURM_SUBMIT_DIR

ml purge
export SINGULARITYENV_TF_FORCE_UNIFIED_MEMORY=1
export SINGULARITYENV_XLA_PYTHON_CLIENT_MEM_FRACTION=4.0

ALPHAFOLD_DATA_DIR=/apps/db/AlphaFold/2.3.1

singularity exec -B /apps/db/AlphaFold -B /apps/eb/CUDAcore/11.2.1 \
 --nv /apps/singularity-images/alphafold_2.3.1_cuda112.sif python /app/alphafold/run_alphafold.py  \
 --use_gpu_relax \
 --data_dir=$ALPHAFOLD_DATA_DIR  \
 --uniref90_database_path=$ALPHAFOLD_DATA_DIR/uniref90/uniref90.fasta  \
 --mgnify_database_path=$ALPHAFOLD_DATA_DIR/mgnify/mgy_clusters.fa  \
 --bfd_database_path=$ALPHAFOLD_DATA_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt  \
 --uniref30_database_path=$ALPHAFOLD_DATA_DIR/uniref30/UniRef30_2021_03 \
 --pdb70_database_path=$ALPHAFOLD_DATA_DIR/pdb70/pdb70  \
 --template_mmcif_dir=$ALPHAFOLD_DATA_DIR/pdb_mmcif/mmcif_files  \
 --obsolete_pdbs_path=$ALPHAFOLD_DATA_DIR/pdb_mmcif/obsolete.dat \
 --model_preset=monomer \
 --max_template_date=2022-10-01 \
 --db_preset=full_dbs \
 --output_dir=./output \
 --fasta_paths=./IL2Y.fasta


Sample job submission script to run the singularity container for v. 2.3.1 for Multimer on a GPU:

#!/bin/bash
#SBATCH --job-name=alphafold
#SBATCH --partition=gpu_p
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=6
#SBATCH --gres=gpu:P100:1
#SBATCH --mem=60gb
#SBATCH --constraint=Intel  
#SBATCH --time=120:00:00
#SBATCH --output=%x.%j.out     
#SBATCH --error=%x.%j.err          

cd $SLURM_SUBMIT_DIR

ml purge
export SINGULARITYENV_TF_FORCE_UNIFIED_MEMORY=1
export SINGULARITYENV_XLA_PYTHON_CLIENT_MEM_FRACTION=4.0

export ALPHAFOLD_DATA_DIR=/apps/db/AlphaFold/2.3.1

singularity exec -B /apps/db/AlphaFold -B /apps/eb/CUDAcore/11.2.1 \
--nv /apps/singularity-images/alphafold_2.3.1_cuda112.sif python /app/alphafold/run_alphafold.py \
--use_gpu_relax \
--data_dir=$ALPHAFOLD_DATA_DIR \
--uniref90_database_path=$ALPHAFOLD_DATA_DIR/uniref90/uniref90.fasta \
--mgnify_database_path=$ALPHAFOLD_DATA_DIR/mgnify/mgy_clusters.fa \
--bfd_database_path=$ALPHAFOLD_DATA_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--uniref30_database_path=$ALPHAFOLD_DATA_DIR/uniref30/UniRef30_2021_03 \
--pdb_seqres_database_path=$ALPHAFOLD_DATA_DIR/pdb_seqres/pdb_seqres.txt \
--template_mmcif_dir=$ALPHAFOLD_DATA_DIR/pdb_mmcif/mmcif_files \
--obsolete_pdbs_path=$ALPHAFOLD_DATA_DIR/pdb_mmcif/obsolete.dat \
--uniprot_database_path=$ALPHAFOLD_DATA_DIR/uniprot/uniprot.fasta \
--model_preset=multimer \
--max_template_date=2022-10-01 \
--db_preset=full_dbs \
--output_dir=./output \
--fasta_paths=./input.fa

Notes about the singularity container for version 2.3.1:

  • Use the -B /apps/db/AlphaFold option to allow singularity to access the location where the database files are installed.
  • Use the -B /apps/eb/CUDAcore/11.2.1 option to allow singularity to access the CUDA libraries.
  • Use the --nv option to allow singularity to run on a GPU. Note that the job will also need to request a GPU device using the #SBATCH --gres parameter.
  • The only parameter for the run_alphafold.py script that you need to change in these sample job submission scripts is the path to your fasta file: --fasta_paths=
  • You can also change these: --max_template_date and --output_dir
  • The lines that have $ALPHAFOLD_DATA_DIR can be used exactly as they are.
  • The job will run initially on CPU only, at a later stage it runs on a single GPU (so it suffices to request one GPU device for the job.
  • This version works on nodes where the CPU processor is Intel, such as the P100 GPU nodes (note that this container does not work on the A100 node, which has an AMD processor).


Sample job submission script to run the singularity container for v. 2.2.4 for Monomer on a GPU:

#!/bin/bash
#SBATCH --job-name=alphafoldjobname       
#SBATCH --partition=gpu_p         
#SBATCH --ntasks=1                  	
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:P100:1
#SBATCH --mem=50gb        
#SBATCH --constraint=Intel            
#SBATCH --time=120:00:00           
#SBATCH --output=%x.%j.out     
#SBATCH --error=%x.%j.err          

cd $SLURM_SUBMIT_DIR

ml purge
export SINGULARITYENV_TF_FORCE_UNIFIED_MEMORY=1
export SINGULARITYENV_XLA_PYTHON_CLIENT_MEM_FRACTION=4.0

ALPHAFOLD_DATA_DIR=/apps/db/AlphaFold/2.2.4

singularity exec -B /apps/db/AlphaFold --nv /apps/singularity-images/alphafold_2.2.4.sif python /app/alphafold/run_alphafold.py  \
 --use_gpu_relax \
 --data_dir=$ALPHAFOLD_DATA_DIR  \
 --uniref90_database_path=$ALPHAFOLD_DATA_DIR/uniref90/uniref90.fasta  \
 --mgnify_database_path=$ALPHAFOLD_DATA_DIR/mgnify/mgy_clusters.fa  \
 --bfd_database_path=$ALPHAFOLD_DATA_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt  \
 --uniclust30_database_path=$ALPHAFOLD_DATA_DIR/uniclust30/uniclust30/UniRef30_2021_03 \
 --pdb70_database_path=$ALPHAFOLD_DATA_DIR/pdb70/pdb70  \
 --template_mmcif_dir=$ALPHAFOLD_DATA_DIR/pdb_mmcif/mmcif_files  \
 --obsolete_pdbs_path=$ALPHAFOLD_DATA_DIR/pdb_mmcif/obsolete.dat \
 --model_preset=monomer \
 --max_template_date=2022-1-1 \
 --db_preset=full_dbs \
 --output_dir=./output \
 --fasta_paths=./input.fasta

Sample job submission script to run the singularity container for v. 2.2.4 for Multimer on a GPU:

#!/bin/bash
#SBATCH --job-name=alphafoldjobname    
#SBATCH --partition=gpu_p         
#SBATCH --ntasks=1                  	
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:P100:1
#SBATCH --mem=50gb         
#SBATCH --constraint=Intel           
#SBATCH --time=120:00:00           
#SBATCH --output=%x.%j.out     
#SBATCH --error=%x.%j.err          

cd $SLURM_SUBMIT_DIR
ml purge

export SINGULARITYENV_TF_FORCE_UNIFIED_MEMORY=1
export SINGULARITYENV_XLA_PYTHON_CLIENT_MEM_FRACTION=4.0

ALPHAFOLD_DATA_DIR=/apps/db/AlphaFold/2.2.4

singularity exec -B /apps/db/AlphaFold --nv /apps/singularity-images/alphafold_2.2.4.sif python /app/alphafold/run_alphafold.py \
--use_gpu_relax \
--data_dir=$ALPHAFOLD_DATA_DIR \
--uniref90_database_path=$ALPHAFOLD_DATA_DIR/uniref90/uniref90.fasta \
--mgnify_database_path=$ALPHAFOLD_DATA_DIR/mgnify/mgy_clusters.fa \
--bfd_database_path=$ALPHAFOLD_DATA_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--uniclust30_database_path=$ALPHAFOLD_DATA_DIR/uniclust30/uniclust30/UniRef30_2021_03 \
--pdb_seqres_database_path=$ALPHAFOLD_DATA_DIR/pdb_seqres/pdb_seqres.txt \
--template_mmcif_dir=$ALPHAFOLD_DATA_DIR/pdb_mmcif/mmcif_files \
--obsolete_pdbs_path=$ALPHAFOLD_DATA_DIR/pdb_mmcif/obsolete.dat \
--uniprot_database_path=$ALPHAFOLD_DATA_DIR/uniprot/uniprot.fasta \
--model_preset=multimer \
--max_template_date=2022-10-01 \
--db_preset=full_dbs \
--output_dir=./output \
--fasta_paths=./input.fasta

Notes about the singularity container for version 2.2.4:

  • Use the -B /apps/db/AlphaFold to allow singularity to access the location where the database files are installed.
  • Use the --nv option to allow singularity to run on a GPU. Note that the job will also need to request a GPU device using the #SBATCH --gres parameter.
  • The only parameter for the run_alphafold.py script that you need to change in these sample job submission scripts is the path to your fasta file: --fasta_paths=
  • You can also change these: --max_template_date and --output_dir
  • The lines that have $ALPHAFOLD_DATA_DIR can be used exactly as they are.
  • The job will run initially on CPU only, at a later stage it runs on a single GPU (so it suffices to request one GPU device for the job.
  • This version works on nodes where the CPU processor is Intel, such as the P100 GPU nodes (note that this container does not work on the A100 node, which has an AMD processor).


Sample job submission script (sub.sh) to run AlphaFold 2.3.1 in a batch job (with GPU):

#!/bin/bash
#SBATCH --job-name=alphafoldjobname    
#SBATCH --partition=gpu_p         
#SBATCH --ntasks=1                  	
#SBATCH --cpus-per-task=10
#SBATCH --gres=gpu:A100:1
#SBATCH --mem=40gb                    
#SBATCH --time=120:00:00           
#SBATCH --output=%x.%j.out     
#SBATCH --error=%x.%j.err          

cd $SLURM_SUBMIT_DIR

ml AlphaFold/2.3.1-foss-2022a-CUDA-11.7.0

alphafold [options]

where [options] need to be replaced by the options (command and arguments) you want to use. Other parameters of the job, such as the maximum wall clock time, maximum memory, the number of cores per node, and the job name need to be modified appropriately as well.

An example of the options to use for the alphafold script:

alphafold --data_dir /apps/db/AlphaFold/2.3.1 --output_dir ./output --model_names model_1 --fasta_paths ./query.fasta --max_template_date 2021-11-17

Example of job submission

sbatch sub.sh 

A method to accelerate your calculations

If you have many AlphaFold calculations, i.e., need to process many FASTA files, you can run them using an array job (see here).

At the same time you can greatly speed up your calculations by running two separate jobs using the CPU and GPU nodes sequentially:

1. Run the first step of AlphaFold (MSA generation) on the CPU nodes on the batch partition.

2. Then run the second part of AlphaFold (structural modeling) using the --use_precomputed_msas=true option on the GPU nodes on the gpu_p partition. If the computation in this part takes less than 4 hours, you can run the job using the V100 scavenge GPU nodes via batch partition (see here).

In order to achieve this, you need to find a way to stop all running element jobs (or elements) in step 1 after all element jobs in step 1 have completed the MSA generation.


With the help from a GACRC user, we have prepared a shell script called check_and_stop_elements.sh:

1. This script will check for the presence of the "stop string", i.e., Running model model_1_multimer_v3_pred_0, in the .err file of each running element you started in step 1.

2. You will run this script in an interactive session (see here) or in an OOD X Desktop session (see here) to monitor your elements during the time of elements running.

3. It will check your running elements with an interval of 5 minutes. If the "stop string" is found in a .err file of a running element, the script will automatically cancel the element for you, such that you can start step 2 as early as possible.

4. You can stop running the script (i.e. stop monitoring elements in step 1) by typing Ctrl + c keys on your keyboard.


As a quick demo, below is an example job submission script (sub_step1.sh) for running step 1 on batch:

#!/bin/bash
#SBATCH --job-name=alphaMSAs
#SBATCH --partition=batch
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32gb
#SBATCH --time=12:00:00
#SBATCH --output=%x.%j.out     
#SBATCH --error=%x.%j.err          
#SBATCH --array=1-200

cd $SLURM_SUBMIT_DIR

ml purge
ml AlphaFold/2.3.1-foss-2022a-CUDA-11.7.0
export ALPHAFOLD_DATA_DIR=/apps/db/AlphaFold/2.3.1

file=$(awk "NR==${SLURM_ARRAY_TASK_ID}" input.lst)

alphafold \
--run_relax=False \
--data_dir=$ALPHAFOLD_DATA_DIR \
--uniref90_database_path=$ALPHAFOLD_DATA_DIR/uniref90/uniref90.fasta \
--mgnify_database_path=$ALPHAFOLD_DATA_DIR/mgnify/mgy_clusters.fa \
--bfd_database_path=$ALPHAFOLD_DATA_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--uniref30_database_path=$ALPHAFOLD_DATA_DIR/uniref30/UniRef30_2021_03 \
--pdb_seqres_database_path=$ALPHAFOLD_DATA_DIR/pdb_seqres/pdb_seqres.txt \
--template_mmcif_dir=$ALPHAFOLD_DATA_DIR/pdb_mmcif/mmcif_files \
--obsolete_pdbs_path=$ALPHAFOLD_DATA_DIR/pdb_mmcif/obsolete.dat \
--uniprot_database_path=$ALPHAFOLD_DATA_DIR/uniprot/uniprot.fasta \
--model_preset=multimer \
--num_multimer_predictions_per_model=1 \
--max_template_date=2023-10-01 \
--db_preset=full_dbs \
--output_dir=./outputs/$(basename $file .fa) \
--fasta_paths=./inputs/$file

The above input.lst is a single-column text file storing the names of your FASTA files (please refer here), for example:

head -n 5 input.lst 

PFago.v8prot.zeyw.fa
PFago.v8prot.zeyx.fa
PFago.v8prot.zeyy.fa
PFago.v8prot.zeyz.fa
PFago.v8prot.zeza.fa

and those FASTA files are stored in a folder called inputs in your current job working folder (--fasta_paths=./inputs/$file).


Below is the shell script check_and_stop_elements.sh. You are welcome to copy it for your use:

#!/bin/bash

ArrayID=$1

while true; do
  numOfRunning=$(squeue -l -j $ArrayID | grep RUNNING | wc -l)
  echo
  echo "Pending elements in array job $ArrayID:"
  squeue -l -j $ArrayID | grep PENDING
  echo
  echo "Number of running elements in array job $ArrayID:"
  echo "$numOfRunning"
  echo
  echo "Checking the stop string \"Running model model_1_multimer_v3_pred_0\" in alphaMSAs.*.err for each running element:"
  echo

  for j in $(squeue -l -j $ArrayID | grep RUNNING | awk '{print $1}')
  do
    jobid=$(scontrol show job $j | grep JobId= | awk '{print $1}' | sed -nr 's|JobId=()|\1|p')
    echo -n "Checking alphaMSAs.${jobid}.err ... "
    grep -m 1 "Running model model_1_multimer_v3_pred_0" alphaMSAs.${jobid}.err 2>&1 1>/dev/null
    if [ $? -eq 0 ]; then
      echo -en "\e[31mthe stop string is found!\e[0m ... go to cancel job ${jobid} ($j) ... scancel $j ... \e[32mjob canceled.\e[0m\n"
      scancel $j
    else
      echo -e "the stop string is NOT found!"
    fi
   done
   sleep 300
   clear
done

Before you can run it, please change its mode to executable by:

chmod 755 ./check_and_stop_elements.sh

Then you can open an interactive session or an OOD X Desktop session to run this script to monitor your elements when they are running on compute nodes by:

./check_and_stop_elements.sh <ARRAY_JOB_ID>

Please note. The above ARRAY_JOB_ID is the ID of your array job. For example, sq --me or sacct-gacrc -X reports you an JOBID 28026197_1, where 28026197 is the ID of your array job. So, the above command will be:

./check_and_stop_elements.sh 28026197


Below is example outputs from running the above command line:

./check_and_stop_elements.sh 28026197

Pending elements in array job 28026197:

Number of running elements in array job 28026197:
6

Checking the stop string "Running model model_1_multimer_v3_pred_0" in alphaMSAs.*.err for each running element:

Checking alphaMSAs.28026197.err ... the stop string is found! ... go to cancel job 28026197 (28026197_10) ... scancel 28026197_10 ... job canceled.
Checking alphaMSAs.28026206.err ... the stop string is found! ... go to cancel job 28026206 (28026197_9) ... scancel 28026197_9 ... job canceled.
Checking alphaMSAs.28026205.err ... the stop string is NOT found!
Checking alphaMSAs.28026201.err ... the stop string is NOT found!
Checking alphaMSAs.28026200.err ... the stop string is NOT found!
Checking alphaMSAs.28026199.err ... the stop string is NOT found!


Once all elements in step 1 have completed the MSA generation, you can go to step 2. As mentioned above, if the computation in step 2 takes less than 4 hours, you can run the job using the V100 scavenge GPU nodes via batch partition. Below is an example job submission script (sub_step2.sh) for running step 2 on the V100 scavenge GPU nodes:

#!/bin/bash
#SBATCH --job-name=alphafold
#SBATCH --partition=batch
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32gb
#SBATCH --gres=gpu:V100:1
#SBATCH --time=4:00:00
#SBATCH --output=%x.%j.out     
#SBATCH --error=%x.%j.err          
#SBATCH --array=1-200

cd $SLURM_SUBMIT_DIR

ml purge

ml AlphaFold/2.3.1-foss-2022a-CUDA-11.7.0
export ALPHAFOLD_DATA_DIR=/apps/db/AlphaFold/2.3.1
export TF_FORCE_UNIFIED_MEMORY=1
export XLA_PYTHON_CLIENT_MEM_FRACTION=8

file=$(awk "NR==${SLURM_ARRAY_TASK_ID}" input.lst)

alphafold \
--run_relax=False \
--data_dir=$ALPHAFOLD_DATA_DIR \
--uniref90_database_path=$ALPHAFOLD_DATA_DIR/uniref90/uniref90.fasta \
--mgnify_database_path=$ALPHAFOLD_DATA_DIR/mgnify/mgy_clusters.fa \
--bfd_database_path=$ALPHAFOLD_DATA_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--uniref30_database_path=$ALPHAFOLD_DATA_DIR/uniref30/UniRef30_2021_03 \
--pdb_seqres_database_path=$ALPHAFOLD_DATA_DIR/pdb_seqres/pdb_seqres.txt \
--template_mmcif_dir=$ALPHAFOLD_DATA_DIR/pdb_mmcif/mmcif_files \
--obsolete_pdbs_path=$ALPHAFOLD_DATA_DIR/pdb_mmcif/obsolete.dat \
--uniprot_database_path=$ALPHAFOLD_DATA_DIR/uniprot/uniprot.fasta \
--model_preset=multimer \
--use_precomputed_msas=true \
--num_multimer_predictions_per_model=1 \
--max_template_date=2023-10-01 \
--db_preset=full_dbs \
--output_dir=./outputs/$(basename $file .fa) \
--fasta_paths=./inputs/$file

Please note. We use --use_precomputed_msas=true option in the above alphafold command line.

Documentation

Details and references are at https://github.com/deepmind/alphafold.

Version 2.3.4: Short help options

ml AlphaFold/2.3.4-foss-2022a-CUDA-11.7.0-ColabFold

export ALPHAFOLD_DATA_DIR=/apps/db/AlphaFold/2.3.4

alphafold --helpshort

Full AlphaFold protein structure prediction script.
flags:

/apps/eb/AlphaFold/2.3.4-foss-2022a-CUDA-11.7.0-ColabFold/bin/alphafold:
  --[no]benchmark: Run multiple JAX model evaluations to obtain a timing that excludes the compilation time, which should be more indicative of the time
    required for inferencing many proteins.
    (default: 'false')
  --bfd_database_path: Path to the BFD database for use by HHblits.
    (default: '/apps/db/AlphaFold/2.3.4/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt')
  --data_dir: Path to directory of supporting data.
    (default: '/apps/db/AlphaFold/2.3.4')
  --db_preset: <full_dbs|reduced_dbs>: Choose preset MSA database configuration - smaller genetic database config (reduced_dbs) or full genetic database config
    (full_dbs)
    (default: 'full_dbs')
  --fasta_paths: Paths to FASTA files, each containing a prediction target that will be folded one after another. If a FASTA file contains multiple sequences,
    then it will be folded as a multimer. Paths should be separated by commas. All FASTA paths must have a unique basename as the basename is used to name the
    output directories for each prediction.
    (a comma separated list)
  --hhblits_binary_path: Path to the HHblits executable.
    (default: '/apps/eb/HH-suite/3.3.0-gompi-2022a/bin/hhblits')
  --hhsearch_binary_path: Path to the HHsearch executable.
    (default: '/apps/eb/HH-suite/3.3.0-gompi-2022a/bin/hhsearch')
  --hmmbuild_binary_path: Path to the hmmbuild executable.
    (default: '/apps/eb/HMMER/3.3.2-gompi-2022a/bin/hmmbuild')
  --hmmsearch_binary_path: Path to the hmmsearch executable.
    (default: '/apps/eb/HMMER/3.3.2-gompi-2022a/bin/hmmsearch')
  --jackhmmer_binary_path: Path to the JackHMMER executable.
    (default: '/apps/eb/HMMER/3.3.2-gompi-2022a/bin/jackhmmer')
  --kalign_binary_path: Path to the Kalign executable.
    (default: '/apps/eb/Kalign/3.3.5-GCCcore-11.3.0/bin/kalign')
  --max_template_date: Maximum template release date to consider. Important if folding historical test sets.
  --mgnify_database_path: Path to the MGnify database for use by JackHMMER.
    (default: '/apps/db/AlphaFold/2.3.4/mgnify/mgy_clusters_2022_05.fa')
  --model_preset: <monomer|monomer_casp14|monomer_ptm|multimer>: Choose preset model configuration - the monomer model, the monomer model with extra ensembling,
    monomer model with pTM head, or multimer model
    (default: 'monomer')
  --models_to_relax: <all|best|none>: The models to run the final relaxation step on. If `all`, all models are relaxed, which may be time consuming. If `best`,
    only the most confident model is relaxed. If `none`, relaxation is not run. Turning off relaxation might result in predictions with distracting
    stereochemical violations but might help in case you are having issues with the relaxation stage.
    (default: 'best')
  --num_multimer_predictions_per_model: How many predictions (each with a different random seed) will be generated per model. E.g. if this is 2 and there are 5
    models then there will be 10 predictions per input. Note: this FLAG only applies if model_preset=multimer
    (default: '5')
    (an integer)
  --obsolete_pdbs_path: Path to file containing a mapping from obsolete PDB IDs to the PDB IDs of their replacements.
    (default: '/apps/db/AlphaFold/2.3.4/pdb_mmcif/obsolete.dat')
  --output_dir: Path to a directory that will store the results.
  --pdb70_database_path: Path to the PDB70 database for use by HHsearch.
    (default: '/apps/db/AlphaFold/2.3.4/pdb70/pdb70')
  --pdb_seqres_database_path: Path to the PDB seqres database for use by hmmsearch.
  --random_seed: The random seed for the data pipeline. By default, this is randomly generated. Note that even if this is set, Alphafold may still not be
    deterministic, because processes like GPU inference are nondeterministic.
    (an integer)
  --small_bfd_database_path: Path to the small version of BFD used with the "reduced_dbs" preset.
  --template_mmcif_dir: Path to a directory with template mmCIF structures, each named <pdb_id>.cif
    (default: '/apps/db/AlphaFold/2.3.4/pdb_mmcif/mmcif_files')
  --uniprot_database_path: Path to the Uniprot database for use by JackHMMer.
  --uniref30_database_path: Path to the UniRef30 database for use by HHblits.
    (default: '/apps/db/AlphaFold/2.3.4/uniref30/UniRef30_2021_03')
  --uniref90_database_path: Path to the Uniref90 database for use by JackHMMER.
    (default: '/apps/db/AlphaFold/2.3.4/uniref90/uniref90.fasta')
  --[no]use_gpu_relax: Whether to relax on GPU. Relax on GPU can be much faster than CPU, so it is recommended to enable if possible. GPUs must be available if
    this setting is enabled.
    (default: 'true')
  --[no]use_precomputed_msas: Whether to read MSAs that have been written to disk instead of running the MSA tools. The MSA files are looked up in the output
    directory, so it must stay the same between multiple runs that are to reuse the MSAs. WARNING: This will not check if the sequence, database or
    configuration have changed.
    (default: 'false')

Try --helpfull to get a list of all flags.

Version 2.3.4: Full help options

ml AlphaFold/2.3.4-foss-2022a-CUDA-11.7.0-ColabFold

export ALPHAFOLD_DATA_DIR=/apps/db/AlphaFold/2.3.4

alphafold --helpfull

Full AlphaFold protein structure prediction script.
flags:

/apps/eb/AlphaFold/2.3.4-foss-2022a-CUDA-11.7.0-ColabFold/bin/alphafold:
  --[no]benchmark: Run multiple JAX model evaluations to obtain a timing that excludes the compilation time, which should be more indicative of the time
    required for inferencing many proteins.
    (default: 'false')
  --bfd_database_path: Path to the BFD database for use by HHblits.
    (default: '/apps/db/AlphaFold/2.3.4/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt')
  --data_dir: Path to directory of supporting data.
    (default: '/apps/db/AlphaFold/2.3.4')
  --db_preset: <full_dbs|reduced_dbs>: Choose preset MSA database configuration - smaller genetic database config (reduced_dbs) or full genetic database config
    (full_dbs)
    (default: 'full_dbs')
  --fasta_paths: Paths to FASTA files, each containing a prediction target that will be folded one after another. If a FASTA file contains multiple sequences,
    then it will be folded as a multimer. Paths should be separated by commas. All FASTA paths must have a unique basename as the basename is used to name the
    output directories for each prediction.
    (a comma separated list)
  --hhblits_binary_path: Path to the HHblits executable.
    (default: '/apps/eb/HH-suite/3.3.0-gompi-2022a/bin/hhblits')
  --hhsearch_binary_path: Path to the HHsearch executable.
    (default: '/apps/eb/HH-suite/3.3.0-gompi-2022a/bin/hhsearch')
  --hmmbuild_binary_path: Path to the hmmbuild executable.
    (default: '/apps/eb/HMMER/3.3.2-gompi-2022a/bin/hmmbuild')
  --hmmsearch_binary_path: Path to the hmmsearch executable.
    (default: '/apps/eb/HMMER/3.3.2-gompi-2022a/bin/hmmsearch')
  --jackhmmer_binary_path: Path to the JackHMMER executable.
    (default: '/apps/eb/HMMER/3.3.2-gompi-2022a/bin/jackhmmer')
  --kalign_binary_path: Path to the Kalign executable.
    (default: '/apps/eb/Kalign/3.3.5-GCCcore-11.3.0/bin/kalign')
  --max_template_date: Maximum template release date to consider. Important if folding historical test sets.
  --mgnify_database_path: Path to the MGnify database for use by JackHMMER.
    (default: '/apps/db/AlphaFold/2.3.4/mgnify/mgy_clusters_2022_05.fa')
  --model_preset: <monomer|monomer_casp14|monomer_ptm|multimer>: Choose preset model configuration - the monomer model, the monomer model with extra ensembling,
    monomer model with pTM head, or multimer model
    (default: 'monomer')
  --models_to_relax: <all|best|none>: The models to run the final relaxation step on. If `all`, all models are relaxed, which may be time consuming. If `best`,
    only the most confident model is relaxed. If `none`, relaxation is not run. Turning off relaxation might result in predictions with distracting
    stereochemical violations but might help in case you are having issues with the relaxation stage.
    (default: 'best')
  --num_multimer_predictions_per_model: How many predictions (each with a different random seed) will be generated per model. E.g. if this is 2 and there are 5
    models then there will be 10 predictions per input. Note: this FLAG only applies if model_preset=multimer
    (default: '5')
    (an integer)
  --obsolete_pdbs_path: Path to file containing a mapping from obsolete PDB IDs to the PDB IDs of their replacements.
    (default: '/apps/db/AlphaFold/2.3.4/pdb_mmcif/obsolete.dat')
  --output_dir: Path to a directory that will store the results.
  --pdb70_database_path: Path to the PDB70 database for use by HHsearch.
    (default: '/apps/db/AlphaFold/2.3.4/pdb70/pdb70')
  --pdb_seqres_database_path: Path to the PDB seqres database for use by hmmsearch.
  --random_seed: The random seed for the data pipeline. By default, this is randomly generated. Note that even if this is set, Alphafold may still not be
    deterministic, because processes like GPU inference are nondeterministic.
    (an integer)
  --small_bfd_database_path: Path to the small version of BFD used with the "reduced_dbs" preset.
  --template_mmcif_dir: Path to a directory with template mmCIF structures, each named <pdb_id>.cif
    (default: '/apps/db/AlphaFold/2.3.4/pdb_mmcif/mmcif_files')
  --uniprot_database_path: Path to the Uniprot database for use by JackHMMer.
  --uniref30_database_path: Path to the UniRef30 database for use by HHblits.
    (default: '/apps/db/AlphaFold/2.3.4/uniref30/UniRef30_2021_03')
  --uniref90_database_path: Path to the Uniref90 database for use by JackHMMER.
    (default: '/apps/db/AlphaFold/2.3.4/uniref90/uniref90.fasta')
  --[no]use_gpu_relax: Whether to relax on GPU. Relax on GPU can be much faster than CPU, so it is recommended to enable if possible. GPUs must be available if
    this setting is enabled.
    (default: 'true')
  --[no]use_precomputed_msas: Whether to read MSAs that have been written to disk instead of running the MSA tools. The MSA files are looked up in the output
    directory, so it must stay the same between multiple runs that are to reuse the MSAs. WARNING: This will not check if the sequence, database or
    configuration have changed.
    (default: 'false')

absl.app:
  -?,--[no]help: show this help
    (default: 'false')
  --[no]helpfull: show full help
    (default: 'false')
  --[no]helpshort: show this help
    (default: 'false')
  --[no]helpxml: like --helpfull, but generates XML output
    (default: 'false')
  --[no]only_check_args: Set to true to validate args and exit.
    (default: 'false')
  --[no]pdb: Alias for --pdb_post_mortem.
    (default: 'false')
  --[no]pdb_post_mortem: Set to true to handle uncaught exceptions with PDB post mortem.
    (default: 'false')
  --profile_file: Dump profile information to a file (for python -m pstats). Implies --run_with_profiling.
  --[no]run_with_pdb: Set to true for PDB debug mode
    (default: 'false')
  --[no]run_with_profiling: Set to true for profiling the script. Execution will be slower, and the output format might change over time.
    (default: 'false')
  --[no]use_cprofile_for_profiling: Use cProfile instead of the profile module for profiling. This has no effect unless --run_with_profiling is set.
    (default: 'true')

absl.logging:
  --[no]alsologtostderr: also log to stderr?
    (default: 'false')
  --log_dir: directory to write logfiles into
    (default: '')
  --logger_levels: Specify log level of loggers. The format is a CSV list of `name:level`. Where `name` is the logger name used with `logging.getLogger()`, and
    `level` is a level name  (INFO, DEBUG, etc). e.g. `myapp.foo:INFO,other.logger:DEBUG`
    (default: '')
  --[no]logtostderr: Should only log to stderr?
    (default: 'false')
  --[no]showprefixforinfo: If False, do not prepend prefix to info messages when it's logged to stderr, --verbosity is set to INFO level, and python logging is
    used.
    (default: 'true')
  --stderrthreshold: log messages at this level, or more severe, to stderr in addition to the logfile.  Possible values are 'debug', 'info', 'warning', 'error',
    and 'fatal'.  Obsoletes --alsologtostderr. Using --alsologtostderr cancels the effect of this flag. Please also note that this flag is subject to
    --verbosity and requires logfile not be stderr.
    (default: 'fatal')
  -v,--verbosity: Logging verbosity level. Messages logged at this level or lower will be included. Set to 1 for debug logging. If the flag was not set or
    supplied, the value will be changed from the default of -1 (warning) to 0 (info) after flags are parsed.
    (default: '-1')
    (an integer)

absl.testing.absltest:
  --test_random_seed: Random seed for testing. Some test frameworks may change the default value of this flag between runs, so it is not appropriate for seeding
    probabilistic tests.
    (default: '301')
    (an integer)
  --test_randomize_ordering_seed: If positive, use this as a seed to randomize the execution order for test cases. If "random", pick a random seed to use. If 0
    or not set, do not randomize test case execution order. This flag also overrides the TEST_RANDOMIZE_ORDERING_SEED environment variable.
    (default: '')
  --test_srcdir: Root of directory tree where source files live
    (default: '')
  --test_tmpdir: Directory for temporary testing files
    (default: '/tmp/absl_testing')
  --xml_output_file: File to store XML test results
    (default: '')

tensorflow.python.ops.parallel_for.pfor:
  --[no]op_conversion_fallback_to_while_loop: DEPRECATED: Flag is ignored.
    (default: 'true')

tensorflow.python.tpu.client.client:
  --[no]hbm_oom_exit: Exit the script when the TPU HBM is OOM.
    (default: 'true')
  --[no]runtime_oom_exit: Exit the script when the TPU runtime is OOM.
    (default: 'true')

tensorflow.python.tpu.tensor_tracer_flags:
  --delta_threshold: Log if history based diff crosses this threshold.
    (default: '0.5')
    (a number)
  --[no]tt_check_filter: Terminate early to check op name filtering.
    (default: 'false')
  --[no]tt_single_core_summaries: Report single core metric and avoid aggregation.
    (default: 'false')

absl.flags:
  --flagfile: Insert flag definitions from the given file into the command line.
    (default: '')
  --undefok: comma-separated list of flag names that it is okay to specify on the command line even if the program does not define a flag with that name.
    IMPORTANT: flags in this list that have arguments MUST use the --flag=value format.
    (default: '')

Back to Top

Installation

  • Version 2.3.1: Installed using EasyBuild.
  • The database files are installed in /apps/db/AlphaFold/

System

64-bit Linux