Difference between revisions of "Biocluster Alphafold"
Jump to navigation
Jump to search
(→How to Run) |
|||
Line 8: | Line 8: | ||
* Load alphafold module. This loads alphafold, singularity, and the alphafold databases. | * Load alphafold module. This loads alphafold, singularity, and the alphafold databases. | ||
<pre> | <pre> | ||
− | module load alphafold/2.3. | + | module load alphafold/2.3.2 |
</pre> | </pre> | ||
* Create scratch folder. This is important so the temporary data will go to the local scratch disk instead of the system's /tmp folder. The /tmp has limited space which if it gets filled up, the node will become unresponsive and cause jobs to fail. | * Create scratch folder. This is important so the temporary data will go to the local scratch disk instead of the system's /tmp folder. The /tmp has limited space which if it gets filled up, the node will become unresponsive and cause jobs to fail. |
Revision as of 08:08, 19 February 2024
About[edit]
- Alphafold is a Highly accurate protein structure prediction program
- More information at https://github.com/deepmind/alphafold/
How to Run[edit]
- Load alphafold module. This loads alphafold, singularity, and the alphafold databases.
module load alphafold/2.3.2
- Create scratch folder. This is important so the temporary data will go to the local scratch disk instead of the system's /tmp folder. The /tmp has limited space which if it gets filled up, the node will become unresponsive and cause jobs to fail.
mkdir /scratch/$SLURM_JOB_ID export TMPDIR=/scratch/$SLURM_JOB_ID
- Run run_singularity.py to run alphafold. This is a wrapper script for the alphafold singularity container to make things easier to run.
run_singularity.py --data-dir $BIODB --cpus $SLURM_NTASKS --use-gpu --output-dir example_output --fasta-paths example.fasta
- --data-dir parameter should be set to $BIODB. $BIODB points to the location of the alphafold databases
- --cpus parameter should be set to $SLURM_NTASKS. $SLURM_NTASKS is a variable which is equal to the number of processors you reserved
- --use-gpu enables the use of GPUS. Singularity will automatically use the number of the GPUs you have reserved.
- --output-dir parameter specifies where the output files should go. Change this parameter to an folder in your home folder
- --fasta-paths parameter specifies your input fasta files. Only one fasta sequence per a file is allowed. If you want to run on multiple sequences, each sequence needs to be in its own file. Then you can specify multiple files like below
--fasta-paths example.fasta,example2.fasta,example3.fasta
Example Job Script[edit]
#!/bin/bash # ----------------SLURM Parameters---------------- #SBATCH -n 4 #SBATCH -N 1 #SBATCH -p gpu #SBATCH --gres=gpu:1 #SBATCH --mem 70G # ----------------Load Modules-------------------- module load alphafold/2.1.2 # ----------------Commands------------------------ mkdir /scratch/$SLURM_JOB_ID export TMPDIR=/scratch/$SLURM_JOB_ID run_singularity.py --data-dir $BIODB --cpus $SLURM_NTASKS --use-gpu --db-preset full_dbs --output-dir output \ --fasta-paths example.fasta rm -fr /scratch/$SLURM_JOB_ID
Submit Job[edit]
- Submit job to the cluster
sbatch example.sh
Parameters[edit]
- These are all the parameters for run_singularity.py. This can be accessed by running run_singularity.py --help
--fasta-paths FASTA_PATHS [FASTA_PATHS ...], -f FASTA_PATHS [FASTA_PATHS ...] Paths to FASTA files, each containing one sequence. All FASTA paths must have a unique basename as the basename is used to name the output directories for each prediction. --is-prokaryote-list IS_PROKARYOTE_LIST [IS_PROKARYOTE_LIST ...] Optional for multimer system, not used by the single chain system. This list should contain a boolean for each fasta specifying true where the target complex is from a prokaryote, and false where it is not, or where the origin is unknown. These values determine the pairing method for the MSA. --max-template-date MAX_TEMPLATE_DATE, -t MAX_TEMPLATE_DATE Maximum template release date to consider (ISO-8601 format - i.e. YYYY-MM-DD). Important if folding historical test sets. --db-preset {reduced_dbs,full_dbs} Choose preset model configuration - no ensembling with uniref90 + bfd + uniclust30 (full_dbs), or 8 model ensemblings with uniref90 + bfd + uniclust30 (casp14). --model-preset {monomer,monomer_casp14,monomer_ptm,multimer} Choose preset model configuration - the monomer model, the monomer model with extra ensembling, monomer model with pTM head, or multimer model --benchmark, -b Run multiple JAX model evaluations to obtain a timing that excludes the compilation time, which should be more indicative of the time required for inferencing many proteins. --use-precomputed-msas Whether to read MSAs that have been written to disk. WARNING: This will not check if the sequence, database or configuration have changed. --data-dir DATA_DIR, -d DATA_DIR Path to directory with supporting data: AlphaFold parameters and genetic and template databases. Set to the target of download_all_databases.sh. --docker-image DOCKER_IMAGE Alphafold docker image. --output-dir OUTPUT_DIR, -o OUTPUT_DIR Output directory for results. --use-gpu Enable NVIDIA runtime to run with GPUs. --gpu-devices GPU_DEVICES Comma separated list of devices to pass to NVIDIA_VISIBLE_DEVICES. --cpus CPUS, -c CPUS Number of CPUs to use.
Issues[edit]
- If you receive an error like
RuntimeError: HHSearch failed
Most likely you need to increase the amount of memory you are reserving in your job script.