Difference between revisions of "Job Arrays"

From Carl R. Woese Institute for Genomic Biology - University of Illinois Urbana-Champaign
Jump to navigation Jump to search
(Submitting Jobs Effectively)
 
(78 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== Job Array Introduction ==
+
== Introduction ==
 +
Job Arrays allow you to run the same job on different datasets without having to create an individual job script for each dataset.  Thus you are easily able to submit hundreds, even thousands of jobs for each different datasets.
 +
This is accomplished by using the #SBATCH --array parameter in the '''SBATCH''' script.  Then using the '''$SLURM_ARRAY_TASK_ID''' environmental variable to specify which dataset to use.  The resources (number of processors, memory, etc) you specify in the job script will be identical for each job in the array.  More details on SLURM job arrays can be found at [https://slurm.schedmd.com/job_array.html https://slurm.schedmd.com/job_array.html]
  
Making a new copy of the script and then submitting each one for every input data file is time consuming. An alternative is to make a job array using the -t option in your '''QSUB''' '''submission''' '''script'''. The -t option allows many copies of the same script to be queued all at once. You can use the '''$PBS_ARRAYID''' environmental variable to differentiate between the different jobs in the array. The amount of resources you specify in the '''QSUB''' '''submission''' '''script''' is the amount of resources the '''job script''' gets each time it is called.
+
== --array parameter ==
 +
*This will create 10 jobs with the $SLURM_ARRAY_TASK_ID iterating 1 through 10 (1,2,3,4,...,8,9.10)
 +
<pre>#SBATCH --array 1-10</pre>
 +
*This will create 10 jobs with the $SLURM_ARRAY_TASK_ID iterating 1 through 19 with step size 2 (1,3,5,7,...,17,19)
 +
<pre>#SBATCH --array 1-20:2</pre>
 +
*This will create 5 jobs with the $SLURM_ARRAY_TASK_ID set to the 5 specified values (1,5,10,15,20)
 +
<pre>#SBATCH --array 1,5,10,15,20</pre>
 +
*This will create 20 jobs with the $SLURM_ARRAY_TASK_ID iterating 1 through 20 (1,2,3,4...,18,19,20) but only run 4 of the jobs at time.
 +
<pre>#SBATCH --array=1-20%4</pre>
  
In this tutorial, we will be using '''three''' files:
+
== Example Script ==
<pre>array.sh
+
This script will submit 10 jobs. Each job will do the following
job.pl
+
*wait for 10 seconds (sleep 10)
job.conf
+
*Output the hostname of the node it ran on (echo "Hostname: `hostname`")
</pre>
+
*Output the $SLURM_ARRAY_TASK_ID
Lets say you want to run '''16 jobs'''. Instead of submitting 16 different jobs, you can submit one job, but use the ''''-t'''' parameter and the '''PBS_ARRAYID''' variable. You can read more about the&nbsp;&nbsp;''''-t''''&nbsp;&nbsp;parameter at&nbsp;[http://docs.adaptivecomputing.com/torque/4-1-4/Content/topics/commands/qsub.htm#T http://docs.adaptivecomputing.com/torque/4-1-4/Content/topics/commands/qsub.htm]
+
*The output file slurm-%A_%a.out will have that information. '''%A''' is the SLURM job number.  '''%a''' is value of '''$SLURM_ARRAY_TASK_ID'''
<pre>#PBS -t 0-15</pre>
+
 
The -t parameter sets the range of the '''PBS_ARRAYID''' variable. So setting it to
+
<pre>#!/bin/bash
<pre>#PBS -t 0-4</pre>
+
# ----------------SBATCH Parameters----------------- #
will cause the qsub script to call the script 5 times, each time updating the '''PBS_ARRAYID''', from 0 to 4 , which results in
+
#SBATCH -p normal
<pre>( perl job.pl $PBS_ARRAYID )
+
#SBATCH -n 1
 +
#SBATCH -N 1
 +
#SBATCH --mail-user youremail@illinois.edu
 +
#SBATCH --mail-type BEGIN,END,FAIL
 +
#SBATCH -J example_array
 +
#SBATCH -D /home/a-m/USERNAME
 +
#SBATCH -o /home/a-m/USERNAME/slurm-%A_%a.out
 +
#SBATCH --array 1-10
  
perl job.pl 0
+
# ----------------Your Commands------------------- #
perl job.pl 1
 
perl job.pl 2
 
perl job.pl 3
 
perl job.pl 4
 
</pre>
 
== array.sh (Example submission script) ==
 
  
This submission script changes to the current working directory, submits 16 jobs, and reserves 2 processors and 1gb of ram for each job.
+
sleep 10
 +
echo "Hostname: `hostname`"
 +
echo "Job Array Number: $SLURM_ARRAY_TASK_ID"
  
It redirects the stderror and stdout into one file, andemails the job owner on completion or abort.
+
</pre>
 +
*10 different output files will be created.  The output in each one will be like below
 +
<pre>
 +
Hostname: compute-0-16
 +
Job Array Number: 10
 +
</pre>
  
For each job , it passes the '-t' parameter to the job.pl script, which in this case is 0 to 15
+
== Example - Ordered List ==
 +
*Say you have a directory with input data files that are numbered sequentially like below.
 +
*You want to run the program FastQC against these files.
 +
*The input data is located in the directory '''input_data'''
 +
*The results will be placed in the directory '''results'''
 +
<pre>yeast_1_50K.fastq
 +
yeast_2_50K.fastq
 +
yeast_3_50K.fastq
 +
yeast_4_50K.fastq
 +
yeast_5_50K.fastq
 +
yeast_6_50K.fastq
 +
</pre>
 +
*The job array script would be like below.
 
<pre>#!/bin/bash
 
<pre>#!/bin/bash
# ----------------QSUB Parameters----------------- #
+
# ----------------SBATCH Parameters----------------- #
#PBS -q default
+
#SBATCH -p normal
#PBS -l nodes=1:ppn=2,mem=1000mb
+
#SBATCH -n 1
#PBS -M youremail@illinois.edu
+
#SBATCH -N 1
#PBS -m abe
+
#SBATCH --mail-user youremail@illinois.edu
#PBS -N array_of_perl_jobs
+
#SBATCH --mail-type BEGIN,END,FAIL
#PBS -t 0-15
+
#SBATCH -J example_array
#PBS -j oe
+
#SBATCH -D /home/a-m/USERNAME
# ----------------Load Modules-------------------- #
+
#SBATCH -o /home/a-m/USERNAME/slurm-%A_%a.out
module load perl/5.16.1
+
#SBATCH --array 1-6
 +
# ----------------Load Modules--------------------
 +
module load FastQC/0.11.5-IGB-gcc-4.9.4-Java-1.8.0_152
 
# ----------------Your Commands------------------- #
 
# ----------------Your Commands------------------- #
cd $PBS_O_WORKDIR
 
perl job.pl $PBS_ARRAYID</pre>
 
== job.pl (Example Perl script ) ==
 
<pre>#!/usr/bin/env perl
 
#This script echos the job array element that has been passed in
 
  
use strict;
+
echo "Starting FastQC Job"
my $argument = shift @ARGV;
+
fastqc -o results/ input_data/yeast_${SLURM_ARRAY_TASK_ID}_50K.fastq
my $experimentID = $argument + 1;
+
echo "Finishing FastQC Job"
my $experimentName = `head -n $argument job.conf | tail -n1`;
 
 
 
print "This is job number $argument \n";
 
print "About to perform experimentID: $argument experimentName:$experimentName\n";
 
 
</pre>
 
</pre>
== Effectively using the Job Array ==
 
  
You will need to have an additional script or configuration file to use the '''PBS_ARRAYID''' effectively. Otherwise you are simply passing an integer into your tool, which may not have much meaning. Below is an example of a configuration file that specifies an experiment to run for job.pl . As the '''PBS_ARRAYID '''variable increments, the script is instructed to perform its action on the next experiment.
+
== Example - Unordered List ==
 +
*This is for a list of data files that do not have sequential numbers in the filename.
 +
*The input data is located in the directory '''input_data'''
 +
*The results will be located in the directory '''results'''
 +
*This grabs a file from the '''input_data''' directory ('''ls input_data/ | sed -n ${SLURM_ARRAY_TASK_ID}p''') , and then puts that in the variable '''$file'''.
 +
*Then runs fastqc using the '''$file''' variable
  
== Default vs Highthroughput Queue  ==
+
<pre>#!/bin/bash
 +
# ----------------SBATCH Parameters----------------- #
 +
#SBATCH -p normal
 +
#SBATCH -n 1
 +
#SBATCH -N 1
 +
#SBATCH --mail-user youremail@illinois.edu
 +
#SBATCH --mail-type BEGIN,END,FAIL
 +
#SBATCH -J example_array
 +
#SBATCH -D /home/a-m/USERNAME
 +
#SBATCH -o /home/a-m/USERNAME/slurm-%A_%a.out
 +
#SBATCH --array 1-6
 +
# ----------------Load Modules--------------------
 +
module load FastQC/0.11.5-IGB-gcc-4.9.4-Java-1.8.0_152
 +
# ----------------Your Commands------------------- #
 +
echo "Starting FastQC Job"
 +
file=$(ls input_data/ | sed -n ${SLURM_ARRAY_TASK_ID}p)
 +
fastqc -o results input_data/${file}
 +
echo "Finishing FastQC Job"
 +
</pre>
  
The default queue only allows you to submit 80 jobs, but they do not use a walltime limit.
+
== Example - Combining jobs ==
This queue is most appropriate for:
+
*What if you have 650 datasets you want to analyze, but you can only submit 80 jobs at a time. Instead of submitting 80 jobs, and waiting for them to finish to submit the next batch, submit a single job array job that can handle all of the datasets.
 +
*You need to divide your datasets into groups.  Below is an example bash code that can do it.
 +
*First, all of your input datasets need to be a folder called '''input_data'''
 +
*The variable '''$ITEMS_TO_PROCESS''' specifies the size of each group of jobs should be
 +
*The line '''JOBLIST=$(ls input_data/)''' will store the list of the files in the variable '''$JOB_LIST'''
 +
*Then it will calculate which jobs go into which group using '''$SLURM_ARRAY_TASK_ID''' and '''$ITEMS_TO_PROCESS'' to calculate the '''$START_LINE''' and '''$END_LINE'''
 +
<pre>ITEMS_TO_PROCESS=10
 +
JOBLIST=$(ls input_data/)
 +
START_LINE=$(( ${SLURM_ARRAY_TASK_ID} * ${ITEMS_TO_PROCESS}) +1 )
 +
END_LINE=$(( ${START_LINE] + ${ITEMS_TO_PROCESS} ) -1 )
  
The high throughput queue allows you to submit 500 jobs, but they have a walltime limit.
+
</pre>
This queue is most appropriate for:
+
*Below is the full script
 +
*In order to use the following script, you will need to properly set
 +
*'--array' (the number of array elements you want)
 +
*ITEMS_TO_PROCESS (the size of each group of jobs)
  
== Submitting Jobs Effectively==
+
<pre>#!/bin/bash
 +
# ----------------SBATCH Parameters----------------- #
 +
#SBATCH -p normal
 +
#SBATCH -n 1
 +
SSBATCH -N 1
 +
#SBATCH --mail-user  youremail@illinois.edu
 +
#SBATCH --mail-type BEGIN, END, FAIL
 +
#SBATCH -J array_of_jobs
 +
#SBATCH --array 1-10
 +
#SBATCH -D /home/a-m/USERNAME
  
We frequently encounter users having difficulty submitting jobs in the right way to the right queue.
+
# ----------------Load Modules-------------------- #
Here are some of the common scenarios and our suggested resolutions.
+
module load FastQC/0.11.5-IGB-gcc-4.9.4-Java-1.8.0_152
 +
# ----------------Your Commands------------------- #
  
 +
ITEMS_TO_PROCESS=10
 +
JOBLIST=$(ls input_data/)
  
Scenario A,
+
START_LINE=$(( ${SLURM_ARRAY_TASK_ID} * ${ITEMS_TO_PROCESS}) + 1)
 +
END_LINE=$(( ${START_LINE] + ${ITEMS_TO_PROCESS} ) -1)
  
You have 600 jobs/experiments that run relatively quickly.
 
  
Scenario B,
+
#Iteration through START_LINE and END_LINE
 +
for line in `seq ${START_LINE} ${END_LINE}`
 +
do
 +
    DATA_FILE=$( head -n $line ${JOBLIST} | tail -n 1 )
 +
    fastqc -o results input_data/${DATA_FILE}
 +
done
 +
</pre>
  
You have 600 jobs/experiments that take many hours to run.
+
== Resources ==
 
+
[https://slurm.schedmd.com/job_array.html https://slurm.schedmd.com/job_array.html]
Solution:
 
You will not be able to use the high throughput queue due to the walltime limit. You will need to iterate over the jobs in chunks of
 
 
 
== job.conf (example configuration file) ==
 
<pre>
 
experimentA
 
experimentB
 
experimentC
 
experimentD
 
experimentE
 
experimentF
 
experimentG
 
experimentH
 
experimentI
 
experimentJ
 
experimentK
 
experimentL
 
experimentM
 
experimentN
 
experimentO
 
experimentP
 
experimentQ
 
experimentR
 
experimentS
 
experimentT
 
experimentU
 
experimentV
 
experimentW
 
experimentX
 
experimentY
 
experimentZ
 
</pre>
 

Latest revision as of 12:00, 2 October 2019

Introduction[edit]

Job Arrays allow you to run the same job on different datasets without having to create an individual job script for each dataset. Thus you are easily able to submit hundreds, even thousands of jobs for each different datasets. This is accomplished by using the #SBATCH --array parameter in the SBATCH script. Then using the $SLURM_ARRAY_TASK_ID environmental variable to specify which dataset to use. The resources (number of processors, memory, etc) you specify in the job script will be identical for each job in the array. More details on SLURM job arrays can be found at https://slurm.schedmd.com/job_array.html

--array parameter[edit]

  • This will create 10 jobs with the $SLURM_ARRAY_TASK_ID iterating 1 through 10 (1,2,3,4,...,8,9.10)
#SBATCH --array 1-10
  • This will create 10 jobs with the $SLURM_ARRAY_TASK_ID iterating 1 through 19 with step size 2 (1,3,5,7,...,17,19)
#SBATCH --array 1-20:2
  • This will create 5 jobs with the $SLURM_ARRAY_TASK_ID set to the 5 specified values (1,5,10,15,20)
#SBATCH --array 1,5,10,15,20
  • This will create 20 jobs with the $SLURM_ARRAY_TASK_ID iterating 1 through 20 (1,2,3,4...,18,19,20) but only run 4 of the jobs at time.
#SBATCH --array=1-20%4

Example Script[edit]

This script will submit 10 jobs. Each job will do the following

  • wait for 10 seconds (sleep 10)
  • Output the hostname of the node it ran on (echo "Hostname: `hostname`")
  • Output the $SLURM_ARRAY_TASK_ID
  • The output file slurm-%A_%a.out will have that information. %A is the SLURM job number. %a is value of $SLURM_ARRAY_TASK_ID
#!/bin/bash
# ----------------SBATCH Parameters----------------- #
#SBATCH -p normal
#SBATCH -n 1
#SBATCH -N 1
#SBATCH --mail-user youremail@illinois.edu
#SBATCH --mail-type BEGIN,END,FAIL 
#SBATCH -J example_array
#SBATCH -D /home/a-m/USERNAME
#SBATCH -o /home/a-m/USERNAME/slurm-%A_%a.out
#SBATCH --array 1-10

# ----------------Your Commands------------------- #

sleep 10
echo "Hostname: `hostname`"
echo "Job Array Number: $SLURM_ARRAY_TASK_ID"

  • 10 different output files will be created. The output in each one will be like below
Hostname: compute-0-16
Job Array Number: 10

Example - Ordered List[edit]

  • Say you have a directory with input data files that are numbered sequentially like below.
  • You want to run the program FastQC against these files.
  • The input data is located in the directory input_data
  • The results will be placed in the directory results
yeast_1_50K.fastq
yeast_2_50K.fastq
yeast_3_50K.fastq
yeast_4_50K.fastq
yeast_5_50K.fastq
yeast_6_50K.fastq
  • The job array script would be like below.
#!/bin/bash
# ----------------SBATCH Parameters----------------- #
#SBATCH -p normal
#SBATCH -n 1
#SBATCH -N 1
#SBATCH --mail-user youremail@illinois.edu
#SBATCH --mail-type BEGIN,END,FAIL 
#SBATCH -J example_array
#SBATCH -D /home/a-m/USERNAME
#SBATCH -o /home/a-m/USERNAME/slurm-%A_%a.out
#SBATCH --array 1-6
# ----------------Load Modules--------------------
module load FastQC/0.11.5-IGB-gcc-4.9.4-Java-1.8.0_152
# ----------------Your Commands------------------- #

echo "Starting FastQC Job"
fastqc -o results/ input_data/yeast_${SLURM_ARRAY_TASK_ID}_50K.fastq
echo "Finishing FastQC Job"

Example - Unordered List[edit]

  • This is for a list of data files that do not have sequential numbers in the filename.
  • The input data is located in the directory input_data
  • The results will be located in the directory results
  • This grabs a file from the input_data directory (ls input_data/ | sed -n ${SLURM_ARRAY_TASK_ID}p) , and then puts that in the variable $file.
  • Then runs fastqc using the $file variable
#!/bin/bash
# ----------------SBATCH Parameters----------------- #
#SBATCH -p normal
#SBATCH -n 1
#SBATCH -N 1
#SBATCH --mail-user youremail@illinois.edu
#SBATCH --mail-type BEGIN,END,FAIL 
#SBATCH -J example_array
#SBATCH -D /home/a-m/USERNAME
#SBATCH -o /home/a-m/USERNAME/slurm-%A_%a.out
#SBATCH --array 1-6
# ----------------Load Modules--------------------
module load FastQC/0.11.5-IGB-gcc-4.9.4-Java-1.8.0_152
# ----------------Your Commands------------------- #
echo "Starting FastQC Job"
file=$(ls input_data/ | sed -n ${SLURM_ARRAY_TASK_ID}p)
fastqc -o results input_data/${file}
echo "Finishing FastQC Job"

Example - Combining jobs[edit]

  • What if you have 650 datasets you want to analyze, but you can only submit 80 jobs at a time. Instead of submitting 80 jobs, and waiting for them to finish to submit the next batch, submit a single job array job that can handle all of the datasets.
  • You need to divide your datasets into groups. Below is an example bash code that can do it.
  • First, all of your input datasets need to be a folder called input_data
  • The variable $ITEMS_TO_PROCESS specifies the size of each group of jobs should be
  • The line JOBLIST=$(ls input_data/) will store the list of the files in the variable $JOB_LIST
  • Then it will calculate which jobs go into which group using $SLURM_ARRAY_TASK_ID' and $ITEMS_TO_PROCESS to calculate the $START_LINE and $END_LINE
ITEMS_TO_PROCESS=10
JOBLIST=$(ls input_data/)
START_LINE=$(( ${SLURM_ARRAY_TASK_ID} * ${ITEMS_TO_PROCESS}) +1 )
END_LINE=$(( ${START_LINE] + ${ITEMS_TO_PROCESS} ) -1 )

  • Below is the full script
  • In order to use the following script, you will need to properly set
  • '--array' (the number of array elements you want)
  • ITEMS_TO_PROCESS (the size of each group of jobs)
#!/bin/bash
# ----------------SBATCH Parameters----------------- #
#SBATCH -p normal
#SBATCH -n 1
SSBATCH -N 1
#SBATCH --mail-user  youremail@illinois.edu
#SBATCH --mail-type BEGIN, END, FAIL 
#SBATCH -J array_of_jobs
#SBATCH --array 1-10
#SBATCH -D /home/a-m/USERNAME

# ----------------Load Modules-------------------- #
module load FastQC/0.11.5-IGB-gcc-4.9.4-Java-1.8.0_152
# ----------------Your Commands------------------- #

ITEMS_TO_PROCESS=10
JOBLIST=$(ls input_data/)

START_LINE=$(( ${SLURM_ARRAY_TASK_ID} * ${ITEMS_TO_PROCESS}) + 1)
END_LINE=$(( ${START_LINE] + ${ITEMS_TO_PROCESS} ) -1)


#Iteration through START_LINE and END_LINE
for line in `seq ${START_LINE} ${END_LINE}`
do
    DATA_FILE=$( head -n $line ${JOBLIST} | tail -n 1 )
    fastqc -o results input_data/${DATA_FILE}
done

Resources[edit]

https://slurm.schedmd.com/job_array.html