Functions and GNU parallel for effective cluster load management

April 24, 2015

I've been a fan of GNU parallel for a long time. Initially I was sceptical about using it, preferring to write huge for loops but over time I've grown to love it. The beauty of GNU parallel is that it spawns a specified number of jobs in parallel and then submits more jobs as others are completed. This means that you get maximum usage out of the CPUs without overloading the system. There are many excuses for not using it, but perhaps the only valid one is that you have Sun Grid Engine or another job scheduler or manager in place.

GNU parallel is particularly useful when used with functions. Functions are subroutines that may be repeated many times to complete a piece of work. In bash, here is a simple example, which declares a function consisting of a chain of piped commands, and then executes 4 jobs in parallel, until all of *files.txt have been processed.


#!/bin/bash

my_func2() {

INPUT=$1

VAR1=bar

cmd1 $INPUT $VAR1 | cmd2 | cmd3 > ${1}.out

}

export -f my_func
parallel -j4 my_func ::: *files.txt

Nice, but now for something relevant to bioinformatics. Here is a bwa mem wrapper that pulls in all fastq files in the current working directory (including gzip and bzip2 compressed) and processes them in parallel (four at any time). Because each bwa job uses 4 cores at any time, the maximum CPU usage will be 16.


#!/bin/bash

bwamem() {

BWA=/usr/local/bin/bwa

REF=Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa

FQZ=$1

GZ=`echo $FQZ | grep -c '.gz$'`

BZ2=`echo $FQZ | grep -c '.bz2$'`

CPU=4

if [ "$GZ" -eq "1" ] ; then

FQ=`echo $FQZ | sed 's/.gz$//'`

pigz -dc $FQZ \

| $BWA mem -t $CPU $REF - \

| samtools view -uSh - \

| samtools sort - ${FQ}_bwamem.sort

elif [ "$BZ2" -eq "1" ] ; then

FQ=`echo $FQZ | sed 's/.bz2$//'`

pbzip2 -dc $FQZ \

| $BWA mem -t $CPU $REF - \

| samtools view -uSh - \

| samtools sort - ${FQ}_bwamem.sort

else

echo input file not compressed

FA=`head -1 $FQZ | grep -c '^>'`

FQ=`head -1 $FQZ | grep -c '^@'`

if [ "$FQ" -eq "1" -o "$FA" -eq "1" ] ; then

FQ=$FQZ

$BWA mem -t $CPU $REF $FQ \

| samtools view -uSh - \

| samtools sort - ${FQ}_bwamem.sort

else

echo Error. Unknown file format. Exiting.

fi

fi

samtools index ${FQ}_bwamem.sort.bam

}

export -f bwamem

parallel -j4 bwamem ::: *fastq.gz *fastq.bz2 *.fq *fastq

If you need to something more sophisticated, you can transfer environment variables and functions too. If you have a cluster of servers where you have ssh login without password, GNU parallel can direct jobs to those connected computers if directed. To do it, just add the name of the server to the parallel command.


parallel -j4 -S server1,server2,server3 bwamem ::: *fastq.gz

Further watching/reading:
https://www.youtube.com/watch?v=OpaiGYxkSuQ
http://plindenbaum.blogspot.com.au/2013/10/gnu-parallel-for-bioinformatics-my.html
https://www.gnu.org/software/parallel/man.html
http://en.wikipedia.org/wiki/Oracle_Grid_Engine
http://gridscheduler.sourceforge.net/htmlman/htmlman1/qsub.html

Search This Blog

Genome Spot

Functions and GNU parallel for effective cluster load management

Popular posts from this blog

Mass download from google drive using R

Data analysis step 8: Pathway analysis with GSEA

Extract properly paired reads from a BAM file using SamTools