ARCHIVED: Run AlphaFold 2 on Carbonate at IU

This content has been archived, and is no longer maintained by Indiana University. Information here may no longer be accurate, and links may no longer be available or reliable.

On this page:

Overview
Set up your user environment
Submit an AlphaFold job
Notes
Get help

Overview

An implementation of the inference pipeline of AlphaFold v2.0, an application that predicts the 3D structure of arbitrary proteins, is available on ARCHIVED: Carbonate at Indiana University. This is a new model that was entered in CASP14 and published in Nature.

The Indiana University research supercomputers use the Slurm workload manager for resource management and job scheduling; see Use Slurm to submit and manage jobs on IU's research computing systems.

In Slurm, compute resources are grouped into logical sets called partitions, which are essentially job queues. To view details about available partitions and nodes, use the sinfo command; for more about using sinfo, see the View partition and node information section of Use Slurm to submit and manage jobs on IU's research computing systems.

To take advantage of the package's GPU capabilities, you should run AlphaFold on Carbonate's GPU partition. For instructions on setting up a project that provides access to Carbonate's GPU partition, see Use RT Projects to request and manage access to specialized Research Technologies resources. For more about running GPU jobs, see Run GPU-accelerated jobs on Quartz or Big Red 200 at IU.

Set up your user environment

On the research supercomputers at Indiana University, the Modules environment management system provides a convenient method for dynamically customizing your software environment.

To use AlphaFold on Carbonate, you first must add the Anaconda and AlphaFold modules to your user environment; on the command line, enter:

module load anaconda/python3.8/2020.07 alphafold

Submit an AlphaFold job

To submit an AlphaFold batch job on Carbonate, create a Slurm submission script (for example, my_alphafold_job.script) to specify the application you want to run and the resources required to run it. For example, a Slurm submission script for running a batch AlphaFold job on Carbonate may look similar to the following:

#!/bin/bash 

#SBATCH -J alphafold_example
#SBATCH -p gpu
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=24
#SBATCH --gpus-per-node=1
#SBATCH --time=04:00:00
#SBATCH -A slurm-account-name

module load anaconda/python3.8/2020.07
module load alphafold

export TF_FORCE_UNIFIED_MEMORY='1'
export XLA_PYTHON_CLIENT_MEM_FRACTION='4.0'

python /N/soft/rhel7/alphafold/alphafold/run_alphafold.py \
  --fasta_paths=T1050.fasta \
  --output_dir=-/N/slate/$USER \
  --model_names=model_1,model_2,model_3,model_4,model_5 \
  --max_template_date=2020-05-14 \
  --preset=full_dbs \
  --benchmark=False \
  --logtostderr \
  --flagfile=alphafold_flags

In the above example:

The first line indicates that the script should be read using the Bash command interpreter.
The next lines are #SBATCH directives used to pass options to the sbatch command:
- -J alphafold_example names the job alphafold_example.
- -p gpu specifies that the job should run in the GPU partition.
- --nodes=1 requests that a minimum of one node be allocated to this job.
- --ntasks-per-node=1 specifies that one task should be launched per node.
- --cpus-per-task=24 allots 24 processors to the task.
- --gpus-per-node=1 allots one GPU to the task.
- --time=04:00:00 allots a maximum of four hours for the job to run.
- -A slurm-account-name indicates the Slurm Account Name to which resources used by this job should be charged.
  Users belonging to projects approved through RT Projects can find their allocation's Slurm Account Name on the "Home" page in RT Projects; look under "Submitting Slurm Jobs with your Project's Account"; alternatively, on the "Home" page, under "Allocations", select an allocation and look in the table under "Allocation Attributes".
The final line calls the AlphaFold script:
- --fasta_paths=T1050.fasta specifies the protein that should be folded.
- --output_dir=/N/slate/$USER specifies where the folded protein should be placed. UITS Research Technologies recommends using a directory on a large-capacity drive, such as Slate, rather a directory in your home directory space..
For descriptions of the other flags, refer to the AlphaFold documentation.

Notes

AlphaFold uses large databases, which are over 2 TB in total size, to process the proteins. These databases are available on high-speed flash storage on Slate-Scratch and should not be downloaded separately by users. The locations of the databases are specified in the alphafold_flags file, which should look like this:

--jackhmmer_binary_path=/N/soft/rhel7/alphafold/conda/bin/jackhmmer
--uniref90_database_path=/N/scratch/afdb/nvme/13AUG2021/uniref90/uniref90.fasta
--mgnify_database_path=/N/scratch/afdb/nvme/13AUG2021/mgnify/mgy_clusters_2018_12.fa
--pdb70_database_path=/N/scratch/afdb/nvme/13AUG2021/pdb70/pdb70
--data_dir=/N/scratch/afdb/nvme/13AUG2021
--template_mmcif_dir=/N/scratch/afdb/nvme/13AUG2021/pdb_mmcif/mmcif_files
--obsolete_pdbs_path=/N/scratch/afdb/nvme/13AUG2021/pdb_mmcif/obsolete.dat
--uniclust30_database_path=/N/scratch/afdb/nvme/13AUG2021/uniclust30/uniclust30_2018_08/uniclust30_2018_08
--bfd_database_path=/N/scratch/afdb/nvme/13AUG2021/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt
--hhblits_binary_path=/N/soft/rhel7/alphafold/conda/bin/hhblits
--hhsearch_binary_path=/N/soft/rhel7/alphafold/conda/bin/hhsearch
--kalign_binary_path=/N/soft/rhel7/alphafold/conda/bin/kalign

If you choose to run AlphaFold from your own directory, note that a quirk in the application requires stereo_chemical_props.txt to be available in your directory. You can copy it from the /N/soft/rhel7/alphafold/example/alphafold/common directory.

Get help

If you need help or have a question about using AlphaFold on Carbonate, contact the UITS Research Applications and Deep Learning team.