Difference between revisions of "Job Arrays"

From Carl R. Woese Institute for Genomic Biology - University of Illinois Urbana-Champaign
Jump to navigation Jump to search
(Example Script)
 
(27 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
== Introduction ==
 
== Introduction ==
Job Arrays allow you to run the same job on different datasets without having to create an individual job script for each dataset.  Thus you are easily able to submit jobs for hundreds, even thousands of different datasets.
+
Job Arrays allow you to run the same job on different datasets without having to create an individual job script for each dataset.  Thus you are easily able to submit hundreds, even thousands of jobs for each different datasets.
This is accomplished by using the #SBATCH --array parameter in the '''SBATCH''' script.  Then using the '''$SLURM_ARRAY_TASK_ID''' environmental variable to specify which dataset to use.  The resources (number of processors, memory, etc) you specify in the job script will be identical for each of the job arrays.
+
This is accomplished by using the #SBATCH --array parameter in the '''SBATCH''' script.  Then using the '''$SLURM_ARRAY_TASK_ID''' environmental variable to specify which dataset to use.  The resources (number of processors, memory, etc) you specify in the job script will be identical for each job in the array.  More details on SLURM job arrays can be found at [https://slurm.schedmd.com/job_array.html https://slurm.schedmd.com/job_array.html]
  
 
== --array parameter ==
 
== --array parameter ==
 
+
*This will create 10 jobs with the $SLURM_ARRAY_TASK_ID iterating 1 through 10 (1,2,3,4,...,8,9.10)
Lets say you want to run 10 jobs. Instead of submitting 10 different jobs, you can submit one job, but use the '''--array''' parameter and the '''$SLURM_ARRAY_TASK_ID''' variable. You can read more about the '''--array''' parameter at [https://slurm.schedmd.com/job_array.html https://slurm.schedmd.com/job_array.html]
 
 
<pre>#SBATCH --array 1-10</pre>
 
<pre>#SBATCH --array 1-10</pre>
The --array parameter sets the range of the '''$SLURM_ARRAY_TASK_ID''' variable. So setting it to
+
*This will create 10 jobs with the $SLURM_ARRAY_TASK_ID iterating 1 through 19 with step size 2 (1,3,5,7,...,17,19)
<pre>#SBATCH --array 1-4</pre>
+
<pre>#SBATCH --array 1-20:2</pre>
will cause the qsub script to call the script 4 times, each time updating the '''$SLURM_ARRAY_TASK_ID''', from 1 to 4 , which results in
+
*This will create 5 jobs with the $SLURM_ARRAY_TASK_ID set to the 5 specified values (1,5,10,15,20)
 
+
<pre>#SBATCH --array 1,5,10,15,20</pre>
 +
*This will create 20 jobs with the $SLURM_ARRAY_TASK_ID iterating 1 through 20 (1,2,3,4...,18,19,20) but only run 4 of the jobs at time.
 +
<pre>#SBATCH --array=1-20%4</pre>
  
 
== Example Script ==
 
== Example Script ==
 
This script will submit 10 jobs.  Each job will do the following
 
This script will submit 10 jobs.  Each job will do the following
 
*wait for 10 seconds (sleep 10)
 
*wait for 10 seconds (sleep 10)
*Output the hostname of the node it ran on (echo "Hostname: `hostname`"
+
*Output the hostname of the node it ran on (echo "Hostname: `hostname`")
 
*Output the $SLURM_ARRAY_TASK_ID
 
*Output the $SLURM_ARRAY_TASK_ID
*The output file slurm-%A_%a.out will have that information.  '''%A''' is the SLURM job number.  '''%a''' is the SLURM array task id
+
*The output file slurm-%A_%a.out will have that information.  '''%A''' is the SLURM job number.  '''%a''' is value of '''$SLURM_ARRAY_TASK_ID'''
  
 
<pre>#!/bin/bash
 
<pre>#!/bin/bash
Line 44: Line 45:
 
</pre>
 
</pre>
  
== job.pl (Example Perl script ) ==
+
== Example - Ordered List ==
<pre>#!/usr/bin/env perl
+
*Say you have a directory with input data files that are numbered sequentially like below.
#This script outputs the job array element that has been passed in
+
*You want to run the program FastQC against these files.
 
+
*The input data is located in the directory '''input_data'''
use strict;
+
*The results will be placed in the directory '''results'''
my $pbs_array_id = shift @ARGV;
+
<pre>yeast_1_50K.fastq
my $experimentID = $pbs_array_id;
+
yeast_2_50K.fastq
my $experimentName = `head -n $pbs_array_id job.conf | tail -n1`;
+
yeast_3_50K.fastq
 
+
yeast_4_50K.fastq
print "This is job number $pbs_array_id \n";
+
yeast_5_50K.fastq
print "About to perform experimentID: $experimentID experimentName:$experimentName\n";
+
yeast_6_50K.fastq
 
</pre>
 
</pre>
 +
*The job array script would be like below.
 +
<pre>#!/bin/bash
 +
# ----------------SBATCH Parameters----------------- #
 +
#SBATCH -p normal
 +
#SBATCH -n 1
 +
#SBATCH -N 1
 +
#SBATCH --mail-user youremail@illinois.edu
 +
#SBATCH --mail-type BEGIN,END,FAIL
 +
#SBATCH -J example_array
 +
#SBATCH -D /home/a-m/USERNAME
 +
#SBATCH -o /home/a-m/USERNAME/slurm-%A_%a.out
 +
#SBATCH --array 1-6
 +
# ----------------Load Modules--------------------
 +
module load FastQC/0.11.5-IGB-gcc-4.9.4-Java-1.8.0_152
 +
# ----------------Your Commands------------------- #
  
== job.conf (example configuration file) ==
+
echo "Starting FastQC Job"
<pre>dataset0
+
fastqc -o results/ input_data/yeast_${SLURM_ARRAY_TASK_ID}_50K.fastq
dataset1
+
echo "Finishing FastQC Job"
dataset2
 
dataset3
 
dataset4
 
dataset5
 
..
 
dataset650
 
 
</pre>
 
</pre>
  
== Effectively Using job_array_index ==
+
== Example - Unordered List ==
 +
*This is for a list of data files that do not have sequential numbers in the filename.
 +
*The input data is located in the directory '''input_data'''
 +
*The results will be located in the directory '''results'''
 +
*This grabs a file from the '''input_data''' directory ('''ls input_data/ | sed -n ${SLURM_ARRAY_TASK_ID}p''') , and then puts that in the variable '''$file'''.
 +
*Then runs fastqc using the '''$file''' variable
  
You have 650 datasets you want to analyze, but you can only submit 80 jobs at a time. Instead of submitting 80 jobs, and waiting for them to finish, submit a single 80 element array job that can handle all of the datasets.
+
<pre>#!/bin/bash
 
+
# ----------------SBATCH Parameters----------------- #
A simple formula for dividing and sending your datasets to your script is as follows:
+
#SBATCH -p normal
<pre> data sets per job = ceiling ( Number of datasets / Number of Job Elements )  
+
#SBATCH -n 1
data sets per Job = ceiling ( 650 / 80 ) = ceiling(8.12500) = 9
+
#SBATCH -N 1
 +
#SBATCH --mail-user youremail@illinois.edu
 +
#SBATCH --mail-type BEGIN,END,FAIL
 +
#SBATCH -J example_array
 +
#SBATCH -D /home/a-m/USERNAME
 +
#SBATCH -o /home/a-m/USERNAME/slurm-%A_%a.out
 +
#SBATCH --array 1-6
 +
# ----------------Load Modules--------------------
 +
module load FastQC/0.11.5-IGB-gcc-4.9.4-Java-1.8.0_152
 +
# ----------------Your Commands------------------- #
 +
echo "Starting FastQC Job"
 +
file=$(ls input_data/ | sed -n ${SLURM_ARRAY_TASK_ID}p)
 +
fastqc -o results input_data/${file}
 +
echo "Finishing FastQC Job"
 
</pre>
 
</pre>
So that means that your 80 jobs are each responsible for handling 9 datasets. So each time you call your job script, you need to pass it the position in the list of datasets , which is the $PBS_ARRAYID and the data sets per job ( N ) That way, your job will be able to determine which datasets from the list you need to process.
 
  
 +
== Example - Combining jobs ==
 +
*What if you have 650 datasets you want to analyze, but you can only submit 80 jobs at a time. Instead of submitting 80 jobs, and waiting for them to finish to submit the next batch, submit a single job array job that can handle all of the datasets.
 +
*You need to divide your datasets into groups.  Below is an example bash code that can do it.
 +
*First, all of your input datasets need to be a folder called '''input_data'''
 +
*The variable '''$ITEMS_TO_PROCESS''' specifies the size of each group of jobs should be
 +
*The line '''JOBLIST=$(ls input_data/)''' will store the list of the files in the variable '''$JOB_LIST'''
 +
*Then it will calculate which jobs go into which group using '''$SLURM_ARRAY_TASK_ID''' and '''$ITEMS_TO_PROCESS'' to calculate the '''$START_LINE''' and '''$END_LINE'''
 +
<pre>ITEMS_TO_PROCESS=10
 +
JOBLIST=$(ls input_data/)
 +
START_LINE=$(( ${SLURM_ARRAY_TASK_ID} * ${ITEMS_TO_PROCESS}) +1 )
 +
END_LINE=$(( ${START_LINE] + ${ITEMS_TO_PROCESS} ) -1 )
  
 
+
</pre>
Here is some simple pseudo code for this situation
+
*Below is the full script
<pre>data sets per job = N
+
*In order to use the following script, you will need to properly set
startLineNumber =  $SLURM_ARRAY_TASK_ID * datasets per job
 
endLineNumber = startLineNumber + data_sets_per_job
 
 
 
open list of data:
 
      go to  startLineNumber
 
                get dataset
 
                do work with dataset
 
                if lineNumber <  endLineNumber
 
                go to next line
 
</pre>
 
== Putting it all together (Example SBATCH Submission with submissions script, configuration file, and experiment script ==
 
 
 
In order to use the following script, you will need to properly set
 
 
 
 
*'--array' (the number of array elements you want)
 
*'--array' (the number of array elements you want)
 +
*ITEMS_TO_PROCESS (the size of each group of jobs)
  
*-itemsToProcess (the number of items in the job.conf list to pass into your script)
 
 
*-Your script , modules and custom settings
 
 
<pre>#!/bin/bash
 
<pre>#!/bin/bash
 
# ----------------SBATCH Parameters----------------- #
 
# ----------------SBATCH Parameters----------------- #
Line 113: Line 135:
  
 
# ----------------Load Modules-------------------- #
 
# ----------------Load Modules-------------------- #
module load BLAST+/2.6.0-IGB-gcc-4.9.4
+
module load FastQC/0.11.5-IGB-gcc-4.9.4-Java-1.8.0_152
 
# ----------------Your Commands------------------- #
 
# ----------------Your Commands------------------- #
# --EDIT HERE
 
itemsToProcess=10
 
jobList="job.conf"
 
  
#No need to edit this
+
ITEMS_TO_PROCESS=10
taskID=$SLURM_ARRAY_TASK_ID
+
JOBLIST=$(ls input_data/)
startLineNumber=$(($taskID * $itemsToProcess))
+
 
endLineNumber=$(( $startLineNumber + $itemsToProcess ))
+
START_LINE=$(( ${SLURM_ARRAY_TASK_ID} * ${ITEMS_TO_PROCESS}) + 1)
startLineNumber=$(( $startLineNumber + 1))
+
END_LINE=$(( ${START_LINE] + ${ITEMS_TO_PROCESS} ) -1)
#Grab an experiment from the job.conf file
+
 
for line in `seq $startLineNumber $endLineNumber`
+
 
 +
#Iteration through START_LINE and END_LINE
 +
for line in `seq ${START_LINE} ${END_LINE}`
 
do
 
do
     experiment=$( head -n $line $jobList | tail -n 1 )
+
     DATA_FILE=$( head -n $line ${JOBLIST} | tail -n 1 )
# --EDIT HERE
+
    fastqc -o results input_data/${DATA_FILE}
echo blastall -i $experiment -o $experiment\.blast
 
 
done
 
done
 
</pre>
 
</pre>

Latest revision as of 12:00, 2 October 2019

Introduction[edit]

Job Arrays allow you to run the same job on different datasets without having to create an individual job script for each dataset. Thus you are easily able to submit hundreds, even thousands of jobs for each different datasets. This is accomplished by using the #SBATCH --array parameter in the SBATCH script. Then using the $SLURM_ARRAY_TASK_ID environmental variable to specify which dataset to use. The resources (number of processors, memory, etc) you specify in the job script will be identical for each job in the array. More details on SLURM job arrays can be found at https://slurm.schedmd.com/job_array.html

--array parameter[edit]

  • This will create 10 jobs with the $SLURM_ARRAY_TASK_ID iterating 1 through 10 (1,2,3,4,...,8,9.10)
#SBATCH --array 1-10
  • This will create 10 jobs with the $SLURM_ARRAY_TASK_ID iterating 1 through 19 with step size 2 (1,3,5,7,...,17,19)
#SBATCH --array 1-20:2
  • This will create 5 jobs with the $SLURM_ARRAY_TASK_ID set to the 5 specified values (1,5,10,15,20)
#SBATCH --array 1,5,10,15,20
  • This will create 20 jobs with the $SLURM_ARRAY_TASK_ID iterating 1 through 20 (1,2,3,4...,18,19,20) but only run 4 of the jobs at time.
#SBATCH --array=1-20%4

Example Script[edit]

This script will submit 10 jobs. Each job will do the following

  • wait for 10 seconds (sleep 10)
  • Output the hostname of the node it ran on (echo "Hostname: `hostname`")
  • Output the $SLURM_ARRAY_TASK_ID
  • The output file slurm-%A_%a.out will have that information. %A is the SLURM job number. %a is value of $SLURM_ARRAY_TASK_ID
#!/bin/bash
# ----------------SBATCH Parameters----------------- #
#SBATCH -p normal
#SBATCH -n 1
#SBATCH -N 1
#SBATCH --mail-user youremail@illinois.edu
#SBATCH --mail-type BEGIN,END,FAIL 
#SBATCH -J example_array
#SBATCH -D /home/a-m/USERNAME
#SBATCH -o /home/a-m/USERNAME/slurm-%A_%a.out
#SBATCH --array 1-10

# ----------------Your Commands------------------- #

sleep 10
echo "Hostname: `hostname`"
echo "Job Array Number: $SLURM_ARRAY_TASK_ID"

  • 10 different output files will be created. The output in each one will be like below
Hostname: compute-0-16
Job Array Number: 10

Example - Ordered List[edit]

  • Say you have a directory with input data files that are numbered sequentially like below.
  • You want to run the program FastQC against these files.
  • The input data is located in the directory input_data
  • The results will be placed in the directory results
yeast_1_50K.fastq
yeast_2_50K.fastq
yeast_3_50K.fastq
yeast_4_50K.fastq
yeast_5_50K.fastq
yeast_6_50K.fastq
  • The job array script would be like below.
#!/bin/bash
# ----------------SBATCH Parameters----------------- #
#SBATCH -p normal
#SBATCH -n 1
#SBATCH -N 1
#SBATCH --mail-user youremail@illinois.edu
#SBATCH --mail-type BEGIN,END,FAIL 
#SBATCH -J example_array
#SBATCH -D /home/a-m/USERNAME
#SBATCH -o /home/a-m/USERNAME/slurm-%A_%a.out
#SBATCH --array 1-6
# ----------------Load Modules--------------------
module load FastQC/0.11.5-IGB-gcc-4.9.4-Java-1.8.0_152
# ----------------Your Commands------------------- #

echo "Starting FastQC Job"
fastqc -o results/ input_data/yeast_${SLURM_ARRAY_TASK_ID}_50K.fastq
echo "Finishing FastQC Job"

Example - Unordered List[edit]

  • This is for a list of data files that do not have sequential numbers in the filename.
  • The input data is located in the directory input_data
  • The results will be located in the directory results
  • This grabs a file from the input_data directory (ls input_data/ | sed -n ${SLURM_ARRAY_TASK_ID}p) , and then puts that in the variable $file.
  • Then runs fastqc using the $file variable
#!/bin/bash
# ----------------SBATCH Parameters----------------- #
#SBATCH -p normal
#SBATCH -n 1
#SBATCH -N 1
#SBATCH --mail-user youremail@illinois.edu
#SBATCH --mail-type BEGIN,END,FAIL 
#SBATCH -J example_array
#SBATCH -D /home/a-m/USERNAME
#SBATCH -o /home/a-m/USERNAME/slurm-%A_%a.out
#SBATCH --array 1-6
# ----------------Load Modules--------------------
module load FastQC/0.11.5-IGB-gcc-4.9.4-Java-1.8.0_152
# ----------------Your Commands------------------- #
echo "Starting FastQC Job"
file=$(ls input_data/ | sed -n ${SLURM_ARRAY_TASK_ID}p)
fastqc -o results input_data/${file}
echo "Finishing FastQC Job"

Example - Combining jobs[edit]

  • What if you have 650 datasets you want to analyze, but you can only submit 80 jobs at a time. Instead of submitting 80 jobs, and waiting for them to finish to submit the next batch, submit a single job array job that can handle all of the datasets.
  • You need to divide your datasets into groups. Below is an example bash code that can do it.
  • First, all of your input datasets need to be a folder called input_data
  • The variable $ITEMS_TO_PROCESS specifies the size of each group of jobs should be
  • The line JOBLIST=$(ls input_data/) will store the list of the files in the variable $JOB_LIST
  • Then it will calculate which jobs go into which group using $SLURM_ARRAY_TASK_ID' and $ITEMS_TO_PROCESS to calculate the $START_LINE and $END_LINE
ITEMS_TO_PROCESS=10
JOBLIST=$(ls input_data/)
START_LINE=$(( ${SLURM_ARRAY_TASK_ID} * ${ITEMS_TO_PROCESS}) +1 )
END_LINE=$(( ${START_LINE] + ${ITEMS_TO_PROCESS} ) -1 )

  • Below is the full script
  • In order to use the following script, you will need to properly set
  • '--array' (the number of array elements you want)
  • ITEMS_TO_PROCESS (the size of each group of jobs)
#!/bin/bash
# ----------------SBATCH Parameters----------------- #
#SBATCH -p normal
#SBATCH -n 1
SSBATCH -N 1
#SBATCH --mail-user  youremail@illinois.edu
#SBATCH --mail-type BEGIN, END, FAIL 
#SBATCH -J array_of_jobs
#SBATCH --array 1-10
#SBATCH -D /home/a-m/USERNAME

# ----------------Load Modules-------------------- #
module load FastQC/0.11.5-IGB-gcc-4.9.4-Java-1.8.0_152
# ----------------Your Commands------------------- #

ITEMS_TO_PROCESS=10
JOBLIST=$(ls input_data/)

START_LINE=$(( ${SLURM_ARRAY_TASK_ID} * ${ITEMS_TO_PROCESS}) + 1)
END_LINE=$(( ${START_LINE] + ${ITEMS_TO_PROCESS} ) -1)


#Iteration through START_LINE and END_LINE
for line in `seq ${START_LINE} ${END_LINE}`
do
    DATA_FILE=$( head -n $line ${JOBLIST} | tail -n 1 )
    fastqc -o results input_data/${DATA_FILE}
done

Resources[edit]

https://slurm.schedmd.com/job_array.html