Demultiplexing Illumina Sequence data

November 08, 2012

Demultiplexing is a key step in many sequencing based applications, but it isn't always necessary, as the newer Illumina pipeline software provides demultiplexed data as a standard. But if you need to do this yourself, here is an example using fastx_toolkit designed for sequence data with a 6nt barcode (Illumina barcode sequences 1-12). After a run, the Genome Analyzer software provides sequence files like this for read 1 (insert sequence):

FC123_1_1_sequence.txt

And for the barcode/index read:

FC123_1_2_sequence.txt

So here goes:

#Enter dataset parameters

FC='FC123 FC124'

LANES='1 2 3 4 5 6 7 8'

#Create the bcfile
echo 'BC01_ATCACG ATCACG
BC02_CGATGT CGATGT
BC03_TTAGGC TTAGGC
BC04_TGACCA TGACCA
BC05_ACAGTG ACAGTG
BC06_GCCAAT GCCAAT
BC07_CAGATC CAGATC
BC08_ACTTGA ACTTGA
BC09_GATCAG GATCAG
BC10_TAGCTT TAGCTT
BC11_GGCTAC GGCTAC
BC12_CTTGTA CTTGTA' > bcfile.txt

for flowcell in $FC
do
for lane in $LANES
do
paste ${flowcell}_${lane}_1_sequence.txt ${flowcell}_${lane}_2_sequence.txt \
| tr -d '\t' fastx_barcode_splitter.pl --bcfile bcfile.txt --prefix ${flowcell}_${lane}_ --suffix .txt --eol &

done

done
wait

So you can see that we start by pasting read 1 and the index read side-by-side and pipe that straight into the fastx_barcode_splitter script which will demultiplex the datasets by the 12 barcodes specified in the bcfile. If there are any lines missing from either read 1 or It will run each lane in parallel, so be sure to use a computer with plenty of processors. For example, in the above script, I've specified all 8 lanes on 2 flow cells so I will be using 16 processors. OK, so we've demultiplexed, and now we need to trim off the 6nt barcode.

for dataset in `ls FC*BC*.txt | sed 's/.txt//'`
do
fastx_trimmer -t 6 -i ${dataset}.txt -o ${dataset}_trim.txt &
done
wait

The fastx_trimmer will remove the 6 nt from the end of the sequence and output the file with the suffix "_trim.txt". It will trim all the files which start with FC and contain BC and end with .txt, which is all the unambiguously demultiplexed datasets. Use caution when using the "&", as it will send many jobs into the background so if you're not working on a big server, you might crash the computer.

Search This Blog

Genome Spot

Demultiplexing Illumina Sequence data

Popular posts from this blog

Mass download from google drive using R

Data analysis step 8: Pathway analysis with GSEA

Extract properly paired reads from a BAM file using SamTools