Job Arrays: Difference between revisions

Revision as of 09:11, 21 March 2018

Introduction

Job Arrays allow you to run the same job on different datasets without having to create an individual job script for each dataset. Thus you are easily able to submit hundreds, even thousands of jobs for each different datasets. This is accomplished by using the #SBATCH --array parameter in the SBATCH script. Then using the $SLURM_ARRAY_TASK_ID environmental variable to specify which dataset to use. The resources (number of processors, memory, etc) you specify in the job script will be identical for each job in the array. More details on SLURM job arrays can be found at https://slurm.schedmd.com/job_array.html

--array parameter

This will create 10 jobs with the $SLURM_ARRAY_TASK_ID iterating 1 through 10 (1,2,3,4,...,8,9.10)

#SBATCH --array 1-10

This will create 10 jobs with the $SLURM_ARRAY_TASK_ID iterating 1 through 19 with step size 2 (1,3,5,7,...,17,19)

#SBATCH --array 1-20:2

This will create 5 jobs with the $SLURM_ARRAY_TASK_ID set to the 5 specified values (1,5,10,15,20)

#SBATCH --array 1,5,10,15,20

This will create 20 jobs with the $SLURM_ARRAY_TASK_ID iterating 1 through 20 (1,2,3,4...,18,19,20) but only run 4 of the jobs at time.

#SBATCH --array=1-20%4

Example Script

This script will submit 10 jobs. Each job will do the following

wait for 10 seconds (sleep 10)
Output the hostname of the node it ran on (echo "Hostname: `hostname`")
Output the $SLURM_ARRAY_TASK_ID
The output file slurm-%A_%a.out will have that information. %A is the SLURM job number. %a is value of $SLURM_ARRAY_TASK_ID

#!/bin/bash
# ----------------SBATCH Parameters----------------- #
#SBATCH -p normal
#SBATCH -n 1
#SBATCH -N 1
#SBATCH --mail-user youremail@illinois.edu
#SBATCH --mail-type BEGIN,END,FAIL 
#SBATCH -J example_array
#SBATCH -D /home/a-m/USERNAME
#SBATCH -o /home/a-m/USERNAME/slurm-%A_%a.out
#SBATCH --array 1-10

# ----------------Your Commands------------------- #

sleep 10
echo "Hostname: `hostname`"
echo "Job Array Number: $SLURM_ARRAY_TASK_ID"

10 different output files will be created. The output in each one will be like below

Hostname: compute-0-16
Job Array Number: 10

Example - Ordered List

Say you have a directory with input data files that are numbered sequentially like below.
You want to run the program FastQC against these files.
The input data is located in the input_data directory
The results will be placed in the results directory

yeast_1_50K.fastq
yeast_2_50K.fastq
yeast_3_50K.fastq
yeast_4_50K.fastq
yeast_5_50K.fastq
yeast_6_50K.fastq

The job array script would be like below.

#!/bin/bash
# ----------------SBATCH Parameters----------------- #
#SBATCH -p normal
#SBATCH -n 1
#SBATCH -N 1
#SBATCH --mail-user youremail@illinois.edu
#SBATCH --mail-type BEGIN,END,FAIL 
#SBATCH -J example_array
#SBATCH -D /home/a-m/USERNAME
#SBATCH -o /home/a-m/USERNAME/slurm-%A_%a.out
#SBATCH --array 1-6
# ----------------Load Modules--------------------
module load FastQC/0.11.5-IGB-gcc-4.9.4-Java-1.8.0_152
# ----------------Your Commands------------------- #

echo "Starting FastQC Job"
fastqc -o results/ input_data/yeast_${SLURM_ARRAY_TASK_ID}_50K.fastq
echo "Finishing FastQC Job"

Example - Unordered List

This is for a list of data files that do not have sequential numbers in the filename.
The input data is located in the directory input_data
The results will be located in the directory results
This grabs the list of files from a directory, and then the $SLURM_ARRAY_TASK_ID is used to reference each file.

#!/bin/bash
# ----------------SBATCH Parameters----------------- #
#SBATCH -p normal
#SBATCH -n 1
#SBATCH -N 1
#SBATCH --mail-user youremail@illinois.edu
#SBATCH --mail-type BEGIN,END,FAIL 
#SBATCH -J example_array
#SBATCH -D /home/a-m/USERNAME
#SBATCH -o /home/a-m/USERNAME/slurm-%A_%a.out
#SBATCH --array 1-6
# ----------------Load Modules--------------------
module load FastQC/0.11.5-IGB-gcc-4.9.4-Java-1.8.0_152
# ----------------Your Commands------------------- #
echo "Starting FastQC Job"
file=$(ls input_data/ | sed -n ${SLURM_ARRAY_TASK_ID}p)
fastqc -o results input_data/${file}
echo "Finishing FastQC Job"

Effectively Using job_array_index

You have 650 datasets you want to analyze, but you can only submit 80 jobs at a time. Instead of submitting 80 jobs, and waiting for them to finish, submit a single 80 element array job that can handle all of the datasets.

A simple formula for dividing and sending your datasets to your script is as follows:

 data sets per job = ceiling ( Number of datasets / Number of Job Elements ) 
data sets per Job = ceiling ( 650 / 80 ) = ceiling(8.12500) = 9

So that means that your 80 jobs are each responsible for handling 9 datasets. So each time you call your job script, you need to pass it the position in the list of datasets , which is the $PBS_ARRAYID and the data sets per job ( N ) That way, your job will be able to determine which datasets from the list you need to process.

Here is some simple pseudo code for this situation

data sets per job = N
startLineNumber =  $SLURM_ARRAY_TASK_ID * datasets per job
endLineNumber = startLineNumber + data_sets_per_job

open list of data:
      go to  startLineNumber
                get dataset
                do work with dataset
                if lineNumber <  endLineNumber
                go to next line

Putting it all together (Example SBATCH Submission with submissions script, configuration file, and experiment script

In order to use the following script, you will need to properly set

'--array' (the number of array elements you want)

-itemsToProcess (the number of items in the job.conf list to pass into your script)

-Your script , modules and custom settings

#!/bin/bash
# ----------------SBATCH Parameters----------------- #
#SBATCH -p normal
#SBATCH -n 1
SSBATCH -N 1
#SBATCH --mail-user  youremail@illinois.edu
#SBATCH --mail-type BEGIN, END, FAIL 
#SBATCH -J array_of_jobs
#SBATCH --array 1-10
#SBATCH -D /home/a-m/USERNAME

# ----------------Load Modules-------------------- #
module load BLAST+/2.6.0-IGB-gcc-4.9.4
# ----------------Your Commands------------------- #
# --EDIT HERE
itemsToProcess=10
jobList="job.conf"

#No need to edit this
taskID=$SLURM_ARRAY_TASK_ID
startLineNumber=$(($taskID * $itemsToProcess))
endLineNumber=$(( $startLineNumber + $itemsToProcess ))
startLineNumber=$(( $startLineNumber + 1))
#Grab an experiment from the job.conf file
for line in `seq $startLineNumber $endLineNumber`
do
    experiment=$( head -n $line $jobList | tail -n 1 )
# --EDIT HERE
 echo blastall -i $experiment -o $experiment\.blast
done

Resources

https://slurm.schedmd.com/job_array.html

@@ Line 48: / Line 48: @@
 *Say you have a directory with input data files that are numbered sequentially like below.
 *You want to run the program FastQC against these files.
-*They are locate in the directory '''input_data''' in your home folder.
+*The input data is located in the '''input_data''' directory
+*The results will be placed in the '''results''' directory
 <pre>yeast_1_50K.fastq
 yeast_2_50K.fastq

Job Arrays: Difference between revisions

Revision as of 09:11, 21 March 2018

Contents

Introduction

--array parameter

Example Script

Example - Ordered List

Example - Unordered List

Effectively Using job_array_index

Putting it all together (Example SBATCH Submission with submissions script, configuration file, and experiment script

Resources

Navigation menu

Job Arrays: Difference between revisions

Revision as of 09:11, 21 March 2018

Introduction

--array parameter

Example Script

Example - Ordered List

Example - Unordered List

Effectively Using job_array_index

Putting it all together (Example SBATCH Submission with submissions script, configuration file, and experiment script

Resources

Navigation menu

Search