Difference between revisions of "Biocluster Alphafold"

From Carl R. Woese Institute for Genomic Biology - University of Illinois Urbana-Champaign
Jump to navigation Jump to search
(Example Job Script)
(Parameters)
 
(26 intermediate revisions by the same user not shown)
Line 2: Line 2:
  
 
= About =
 
= About =
* Alphafold folds proteins
+
* Alphafold is a Highly accurate protein structure prediction program
 
* More information at [https://github.com/deepmind/alphafold/ https://github.com/deepmind/alphafold/]
 
* More information at [https://github.com/deepmind/alphafold/ https://github.com/deepmind/alphafold/]
  
 
= How to Run =
 
= How to Run =
* Load alphafold module
+
* Load alphafold module.  This loads alphafold, singularity, and the alphafold databases.
 
<pre>
 
<pre>
module load alphafold/2.1.1
+
module load alphafold/2.3.2
 
</pre>
 
</pre>
* Run run_singularity.py
+
* Create scratch folder.  This is important so the temporary data will go to the local scratch disk instead of the system's /tmp folder.  The /tmp has limited space which if it gets filled up, the node will become unresponsive and cause jobs to fail.
 
<pre>
 
<pre>
run_singularity.py
+
mkdir /scratch/$SLURM_JOB_ID
 +
export TMPDIR=/scratch/$SLURM_JOB_ID
 +
</pre>
 +
* Run run_singularity.py to run alphafold.  This is a wrapper script for the alphafold singularity container to make things easier to run.
 +
<pre>
 +
run_singularity.py --data-dir $BIODB --cpus $SLURM_NTASKS --use-gpu --output-dir example_output --fasta-paths example.fasta
 +
</pre>
 +
* --data-dir parameter should be set to $BIODB.  $BIODB points to the location of the alphafold databases
 +
* --cpus parameter should be set to $SLURM_NTASKS.  $SLURM_NTASKS is a variable which is equal to the number of processors you reserved
 +
* --use-gpu enables the use of GPUS.  Singularity will automatically use the number of the GPUs you have reserved.
 +
* --output-dir parameter specifies where the output files should go.  Change this parameter to an folder in your home folder
 +
* --fasta-paths parameter specifies your input fasta files.  Only one fasta sequence per a file is allowed.  If you want to run on multiple sequences, each sequence needs to be in its own file.  Then you can specify multiple files like below
 +
<pre>
 +
--fasta-paths example.fasta,example2.fasta,example3.fasta
 
</pre>
 
</pre>
  
Line 22: Line 35:
 
#SBATCH -N 1
 
#SBATCH -N 1
 
#SBATCH -p gpu
 
#SBATCH -p gpu
#SBATCH --gres=gpu:2
+
#SBATCH --gres=gpu:1
#SBATCH --mem 80G
+
#SBATCH --mem 70G
  
 
# ----------------Load Modules--------------------
 
# ----------------Load Modules--------------------
module load alphafold/2.1.1
+
module load alphafold/2.3.2
 
# ----------------Commands------------------------
 
# ----------------Commands------------------------
 +
mkdir /scratch/$SLURM_JOB_ID
 +
export TMPDIR=/scratch/$SLURM_JOB_ID
 +
 
run_singularity.py --data-dir $BIODB --cpus $SLURM_NTASKS --use-gpu --db-preset full_dbs --output-dir output \
 
run_singularity.py --data-dir $BIODB --cpus $SLURM_NTASKS --use-gpu --db-preset full_dbs --output-dir output \
--fasta-paths example.fasta  
+
--fasta-paths example.fasta
 +
 
 +
rm -fr /scratch/$SLURM_JOB_ID
 +
</pre>
 +
 
 +
= Submit Job =
 +
* Submit job to the cluster
 +
<pre>
 +
sbatch example.sh
 
</pre>
 
</pre>
  
 
= Parameters =
 
= Parameters =
 +
* These are all the parameters for run_singularity.py.  This can be accessed by running '''run_singularity.py --help'''
 
<pre>
 
<pre>
 +
    -h, --help            show this help message and exit
 
   --fasta-paths FASTA_PATHS [FASTA_PATHS ...], -f FASTA_PATHS [FASTA_PATHS ...]
 
   --fasta-paths FASTA_PATHS [FASTA_PATHS ...], -f FASTA_PATHS [FASTA_PATHS ...]
 
                         Paths to FASTA files, each containing one sequence.
 
                         Paths to FASTA files, each containing one sequence.
Line 39: Line 65:
 
                         basename is used to name the output directories for
 
                         basename is used to name the output directories for
 
                         each prediction.
 
                         each prediction.
  --is-prokaryote-list IS_PROKARYOTE_LIST [IS_PROKARYOTE_LIST ...]
 
                        Optional for multimer system, not used by the single
 
                        chain system. This list should contain a boolean for
 
                        each fasta specifying true where the target complex is
 
                        from a prokaryote, and false where it is not, or where
 
                        the origin is unknown. These values determine the
 
                        pairing method for the MSA.
 
 
   --max-template-date MAX_TEMPLATE_DATE, -t MAX_TEMPLATE_DATE
 
   --max-template-date MAX_TEMPLATE_DATE, -t MAX_TEMPLATE_DATE
 
                         Maximum template release date to consider (ISO-8601
 
                         Maximum template release date to consider (ISO-8601
Line 58: Line 77:
 
                         the monomer model with extra ensembling, monomer model
 
                         the monomer model with extra ensembling, monomer model
 
                         with pTM head, or multimer model
 
                         with pTM head, or multimer model
 +
  --num-multimer-predictions-per-model NUM_MULTIMER_PREDICTIONS_PER_MODEL
 +
                        How many predictions (each with a different random
 +
                        seed) will be generated per model. E.g. if this is 2
 +
                        and there are 5 models then there will be 10
 +
                        predictions per input. Note: this FLAG only applies if
 +
                        model_preset=multimer
 
   --benchmark, -b      Run multiple JAX model evaluations to obtain a timing
 
   --benchmark, -b      Run multiple JAX model evaluations to obtain a timing
 
                         that excludes the compilation time, which should be
 
                         that excludes the compilation time, which should be
Line 63: Line 88:
 
                         many proteins.
 
                         many proteins.
 
   --use-precomputed-msas
 
   --use-precomputed-msas
                         Whether to read MSAs that have been written to disk.
+
                         Whether to read MSAs that have been written to disk
 +
                        instead of running the MSA tools. The MSA files are
 +
                        looked up in the output directory, so it must stay the
 +
                        same between multiple runs that are to reuse the MSAs.
 
                         WARNING: This will not check if the sequence, database
 
                         WARNING: This will not check if the sequence, database
 
                         or configuration have changed.
 
                         or configuration have changed.
Line 75: Line 103:
 
                         Output directory for results.
 
                         Output directory for results.
 
   --use-gpu            Enable NVIDIA runtime to run with GPUs.
 
   --use-gpu            Enable NVIDIA runtime to run with GPUs.
 +
  --models-to-relax MODELS_TO_RELAX
 +
                        Whether to run the final relaxation step on the
 +
                        predicted models. Turning relax off might result in
 +
                        predictions with distracting stereochemical violations
 +
                        but might help in case you are having issues with the
 +
                        relaxation stage.
 +
  --enable-gpu-relax    Run relax on GPU if GPU is enabled.
 
   --gpu-devices GPU_DEVICES
 
   --gpu-devices GPU_DEVICES
 
                         Comma separated list of devices to pass to
 
                         Comma separated list of devices to pass to
Line 80: Line 115:
 
   --cpus CPUS, -c CPUS  Number of CPUs to use.
 
   --cpus CPUS, -c CPUS  Number of CPUs to use.
 
</pre>
 
</pre>
 +
 +
= Issues =
 +
*If you receive an error like
 +
<pre>
 +
RuntimeError: HHSearch failed</pre>
 +
Most likely you need to increase the amount of memory you are reserving in your job script.
 +
 +
= References =
 +
* [https://github.com/deepmind/alphafold https://github.com/deepmind/alphafold]
 +
* [https://github.com/dialvarezs/alphafold https://github.com/dialvarezs/alphafold]
 +
* [https://hub.docker.com/r/catgumag/alphafold https://hub.docker.com/r/catgumag/alphafold]

Latest revision as of 12:06, 26 February 2024

About[edit]

How to Run[edit]

  • Load alphafold module. This loads alphafold, singularity, and the alphafold databases.
module load alphafold/2.3.2
  • Create scratch folder. This is important so the temporary data will go to the local scratch disk instead of the system's /tmp folder. The /tmp has limited space which if it gets filled up, the node will become unresponsive and cause jobs to fail.
mkdir /scratch/$SLURM_JOB_ID
export TMPDIR=/scratch/$SLURM_JOB_ID
  • Run run_singularity.py to run alphafold. This is a wrapper script for the alphafold singularity container to make things easier to run.
run_singularity.py --data-dir $BIODB --cpus $SLURM_NTASKS --use-gpu --output-dir example_output --fasta-paths example.fasta 
  • --data-dir parameter should be set to $BIODB. $BIODB points to the location of the alphafold databases
  • --cpus parameter should be set to $SLURM_NTASKS. $SLURM_NTASKS is a variable which is equal to the number of processors you reserved
  • --use-gpu enables the use of GPUS. Singularity will automatically use the number of the GPUs you have reserved.
  • --output-dir parameter specifies where the output files should go. Change this parameter to an folder in your home folder
  • --fasta-paths parameter specifies your input fasta files. Only one fasta sequence per a file is allowed. If you want to run on multiple sequences, each sequence needs to be in its own file. Then you can specify multiple files like below
--fasta-paths example.fasta,example2.fasta,example3.fasta

Example Job Script[edit]

#!/bin/bash
# ----------------SLURM Parameters----------------
#SBATCH -n 4
#SBATCH -N 1
#SBATCH -p gpu
#SBATCH --gres=gpu:1
#SBATCH --mem 70G

# ----------------Load Modules--------------------
module load alphafold/2.3.2
# ----------------Commands------------------------
mkdir /scratch/$SLURM_JOB_ID
export TMPDIR=/scratch/$SLURM_JOB_ID

run_singularity.py --data-dir $BIODB --cpus $SLURM_NTASKS --use-gpu --db-preset full_dbs --output-dir output \
--fasta-paths example.fasta

rm -fr /scratch/$SLURM_JOB_ID

Submit Job[edit]

  • Submit job to the cluster
sbatch example.sh

Parameters[edit]

  • These are all the parameters for run_singularity.py. This can be accessed by running run_singularity.py --help
    -h, --help            show this help message and exit
  --fasta-paths FASTA_PATHS [FASTA_PATHS ...], -f FASTA_PATHS [FASTA_PATHS ...]
                        Paths to FASTA files, each containing one sequence.
                        All FASTA paths must have a unique basename as the
                        basename is used to name the output directories for
                        each prediction.
  --max-template-date MAX_TEMPLATE_DATE, -t MAX_TEMPLATE_DATE
                        Maximum template release date to consider (ISO-8601
                        format - i.e. YYYY-MM-DD). Important if folding
                        historical test sets.
  --db-preset {reduced_dbs,full_dbs}
                        Choose preset model configuration - no ensembling with
                        uniref90 + bfd + uniclust30 (full_dbs), or 8 model
                        ensemblings with uniref90 + bfd + uniclust30 (casp14).
  --model-preset {monomer,monomer_casp14,monomer_ptm,multimer}
                        Choose preset model configuration - the monomer model,
                        the monomer model with extra ensembling, monomer model
                        with pTM head, or multimer model
  --num-multimer-predictions-per-model NUM_MULTIMER_PREDICTIONS_PER_MODEL
                        How many predictions (each with a different random
                        seed) will be generated per model. E.g. if this is 2
                        and there are 5 models then there will be 10
                        predictions per input. Note: this FLAG only applies if
                        model_preset=multimer
  --benchmark, -b       Run multiple JAX model evaluations to obtain a timing
                        that excludes the compilation time, which should be
                        more indicative of the time required for inferencing
                        many proteins.
  --use-precomputed-msas
                        Whether to read MSAs that have been written to disk
                        instead of running the MSA tools. The MSA files are
                        looked up in the output directory, so it must stay the
                        same between multiple runs that are to reuse the MSAs.
                        WARNING: This will not check if the sequence, database
                        or configuration have changed.
  --data-dir DATA_DIR, -d DATA_DIR
                        Path to directory with supporting data: AlphaFold
                        parameters and genetic and template databases. Set to
                        the target of download_all_databases.sh.
  --docker-image DOCKER_IMAGE
                        Alphafold docker image.
  --output-dir OUTPUT_DIR, -o OUTPUT_DIR
                        Output directory for results.
  --use-gpu             Enable NVIDIA runtime to run with GPUs.
  --models-to-relax MODELS_TO_RELAX
                        Whether to run the final relaxation step on the
                        predicted models. Turning relax off might result in
                        predictions with distracting stereochemical violations
                        but might help in case you are having issues with the
                        relaxation stage.
  --enable-gpu-relax    Run relax on GPU if GPU is enabled.
  --gpu-devices GPU_DEVICES
                        Comma separated list of devices to pass to
                        NVIDIA_VISIBLE_DEVICES.
  --cpus CPUS, -c CPUS  Number of CPUs to use.

Issues[edit]

  • If you receive an error like
RuntimeError: HHSearch failed

Most likely you need to increase the amount of memory you are reserving in your job script.

References[edit]