Difference between revisions of "Job Arrays"
Line 1: | Line 1: | ||
== Introduction == | == Introduction == | ||
Job Arrays allow you to run the same job on different datasets without having to create an individual job script for each dataset. Thus you are easily able to submit jobs for hundreds, even thousands of different datasets. | Job Arrays allow you to run the same job on different datasets without having to create an individual job script for each dataset. Thus you are easily able to submit jobs for hundreds, even thousands of different datasets. | ||
− | This is accomplished by using the #SBATCH --array parameter in the '''SBATCH''' script. Then using the '''$SLURM_ARRAY_TASK_ID''' environmental variable to specify which dataset to use. The resources you specify in the job script will be identical for each of the job arrays. | + | This is accomplished by using the #SBATCH --array parameter in the '''SBATCH''' script. Then using the '''$SLURM_ARRAY_TASK_ID''' environmental variable to specify which dataset to use. The resources (number of processors, memory, etc) you specify in the job script will be identical for each of the job arrays. |
== --array parameter == | == --array parameter == |
Revision as of 09:40, 14 March 2018
Contents
Introduction[edit]
Job Arrays allow you to run the same job on different datasets without having to create an individual job script for each dataset. Thus you are easily able to submit jobs for hundreds, even thousands of different datasets. This is accomplished by using the #SBATCH --array parameter in the SBATCH script. Then using the $SLURM_ARRAY_TASK_ID environmental variable to specify which dataset to use. The resources (number of processors, memory, etc) you specify in the job script will be identical for each of the job arrays.
--array parameter[edit]
Lets say you want to run 10 jobs. Instead of submitting 10 different jobs, you can submit one job, but use the --array parameter and the $SLURM_ARRAY_TASK_ID variable. You can read more about the --array parameter at https://slurm.schedmd.com/job_array.html
#SBATCH --array 1-10
The --array parameter sets the range of the $SLURM_ARRAY_TASK_ID variable. So setting it to
#SBATCH --array 1-4
will cause the qsub script to call the script 4 times, each time updating the $SLURM_ARRAY_TASK_ID, from 1 to 4 , which results in
Example Script[edit]
This script will submit 10 jobs. Each job will do the following
- wait for 10 seconds (sleep 10)
- Output the hostname of the node it ran on (echo "Hostname: `hostname`"
- Output the $SLURM_ARRAY_TASK_ID
- The output file slurm-%A_%a.out will have that information. %A is the SLURM job number. %a is the SLURM array task id
#!/bin/bash # ----------------SBATCH Parameters----------------- # #SBATCH -p normal #SBATCH -n 1 #SBATCH -N 1 #SBATCH --mail-user youremail@illinois.edu #SBATCH --mail-type BEGIN,END,FAIL #SBATCH -J example_array #SBATCH -D /home/a-m/USERNAME #SBATCH -o /home/a-m/USERNAME/slurm-%A_%a.out #SBATCH --array 1-10 # ----------------Your Commands------------------- # sleep 10 echo "Hostname: `hostname`" echo "Job Array Number: $SLURM_ARRAY_TASK_ID"
- The output will be like below
Hostname: compute-0-16 Job Array Number: 10
job.pl (Example Perl script )[edit]
#!/usr/bin/env perl #This script outputs the job array element that has been passed in use strict; my $pbs_array_id = shift @ARGV; my $experimentID = $pbs_array_id; my $experimentName = `head -n $pbs_array_id job.conf | tail -n1`; print "This is job number $pbs_array_id \n"; print "About to perform experimentID: $experimentID experimentName:$experimentName\n";
job.conf (example configuration file)[edit]
dataset0 dataset1 dataset2 dataset3 dataset4 dataset5 .. dataset650
Effectively Using job_array_index[edit]
You have 650 datasets you want to analyze, but you can only submit 80 jobs at a time. Instead of submitting 80 jobs, and waiting for them to finish, submit a single 80 element array job that can handle all of the datasets.
A simple formula for dividing and sending your datasets to your script is as follows:
data sets per job = ceiling ( Number of datasets / Number of Job Elements ) data sets per Job = ceiling ( 650 / 80 ) = ceiling(8.12500) = 9
So that means that your 80 jobs are each responsible for handling 9 datasets. So each time you call your job script, you need to pass it the position in the list of datasets , which is the $PBS_ARRAYID and the data sets per job ( N ) That way, your job will be able to determine which datasets from the list you need to process.
Here is some simple pseudo code for this situation
data sets per job = N startLineNumber = $SLURM_ARRAY_TASK_ID * datasets per job endLineNumber = startLineNumber + data_sets_per_job open list of data: go to startLineNumber get dataset do work with dataset if lineNumber < endLineNumber go to next line
Putting it all together (Example SBATCH Submission with submissions script, configuration file, and experiment script[edit]
In order to use the following script, you will need to properly set
- '--array' (the number of array elements you want)
- -itemsToProcess (the number of items in the job.conf list to pass into your script)
- -Your script , modules and custom settings
#!/bin/bash # ----------------SBATCH Parameters----------------- # #SBATCH -p normal #SBATCH -n 1 SSBATCH -N 1 #SBATCH --mail-user youremail@illinois.edu #SBATCH --mail-type BEGIN, END, FAIL #SBATCH -J array_of_jobs #SBATCH --array 1-10 #SBATCH -D /home/a-m/USERNAME # ----------------Load Modules-------------------- # module load BLAST+/2.6.0-IGB-gcc-4.9.4 # ----------------Your Commands------------------- # # --EDIT HERE itemsToProcess=10 jobList="job.conf" #No need to edit this taskID=$SLURM_ARRAY_TASK_ID startLineNumber=$(($taskID * $itemsToProcess)) endLineNumber=$(( $startLineNumber + $itemsToProcess )) startLineNumber=$(( $startLineNumber + 1)) #Grab an experiment from the job.conf file for line in `seq $startLineNumber $endLineNumber` do experiment=$( head -n $line $jobList | tail -n 1 ) # --EDIT HERE echo blastall -i $experiment -o $experiment\.blast done