Pages

Monday, 19 January 2015

SRA toolkit tips and workarounds

The Short Read Archive (SRA) is the main repository for next generation sequencing (NGS) raw data. Considering the sheer rate at which NGS is generated (and accelerating), the team at NCBI should be congratulated for providing this service to the scientific community. Take a look at the growth of SRA:
Growth of SRA data (http://www.ncbi.nlm.nih.gov/Traces/sra/i/g.png)

SRA however doesn't provide directly the fastq files that we commonly work with, they prefer the .sra archive that require specialised software (sra-toolkit) to extract. Sra-toolkit has been described as buggy and painful; and I've had my frustrations with it. In this post, I'll share some of my best tips sra-toolkit tips that I've found.

Get the right version of the software and configure it

When downloading, make sure you download the newest version from the NCBI website (link). Don't download it from GitHub or from Ubuntu software centre (or apt-get), as it will probably be an older version. In the binary directory (looks like /path/to/sratoolkit.2.4.3-ubuntu64/bin) there will be a file called sratoolkit.jar. In linux use "java -jar sratoolkit" to open the graphical interface. in the preferences menu, enable the local repository and select a path for it. By doing this, you can then use sra-toolkit to "stream" fastq data (see below).

EDIT: if you are seeing an error like this one:

/data/app/sratoolkit.2.4.3-ubuntu64/bin/fastq-dump --split-files -A ERR366438

2015-02-15T21:47:01 fastq-dump.2.4.3 err: binary large object corrupt while reading binary large object within virtual database module - failed ERR366438

=============================================================
An error occurred during processing.
A report was generated into the file '/data/home/ncbi_error_report.xml'.
If the problem persists, you may consider sending the file
to 'sra@ncbi.nlm.nih.gov' for assistance.
=============================================================

Then grab the new sra-toolkit version 2.4.4 which seems to fix problems with SRA archives using reference based compression (when submitters provide data in aligned bam format).


Try streaming the data

You can convert sra to fastq on-the-fly by doing either of the following:

fastq-dump -A SRR1722641 -O SRR1722641.fastq

fastq-dump -A SRR900186 -Z > SRR900186.fastq

Streaming paired-end data could be problematic. Use the following to save forward and reverse reads as separate files. Thanks to the folks at Biostars for this idea.

SRR=SRR1041311 ; fastq-dump -X 10 --split-files -I -Z $SRR \
| tee >(grep '@.*\.1\s' -A3 --no-group-separator \
> ${SRR}_1.fastq) >(grep '@.*\.2\s' -A3 --no-group-separator \
> ${SRR}_2.fastq) >/dev/null

Use download accelerator

The SRA team actually recommend using Aspera connect to speed up the download of SRA files. If the stream isn't working for you, give Aspera a try using this script. If you struggle to get Aspera configured, you can try a download accelerator such as axel or aria2c. Here's an example with axel.

axel -n5 ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX709/SRX709649/SRR1585277/SRR1585277.sra

After downloading the SRA archive, dump the fastq:

fastq-dump -A SRR900186.sra -Z --split-files

Via the browser

Here are two useful approaches suggested by SeqAnswers. You can download each fastq.gz file individually from your web-browser (not command line interface) replacing the digits after SSR in this link:

http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=dload&run_list=SRR515925&format=fastq

or batch download with a link like:

http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=dload&run_list=SRR294514,SRR352727,SRR364895&format=fastq

Alternatively find your study accession number (ie. SRP013698) and go to the SRA run selector:

http://trace.ncbi.nlm.nih.gov/Traces/study/?go=home

Search with your SRP number, then click on the "Run" link. Click on the "Reads" tab, then click "Filtered Download", change the format to "FASTQ" and hit "Download".

SRA mirrors

Most of the data on SRA is mirrorred at ENA or DNAnexus.

You can download the compressed fastq files from ENA for forward and reverse reads

ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR504/SRR504687/SRR504687_1.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR504/SRR504687/SRR504687_2.fastq.gz

You can download the SRA archive from DNAnexus too.

ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR504/SRR504687/SRR504687.sra

Stream directly into your analysis pipeline

You can send the data straight through your QC and alignment pipeline without saving intermediate files. Here is an example using SRA toolkit for Olego alignment:

fastq-dump -A SRR764858 -Z \
| fastq_quality_trimmer -l 25 -t 20 -Q33 \
| olego -t 8  Athaliana.TAIR10.23.dna_sm.genome.fa - \
| samtools view -uSh - \
| samtools sort - SRR764858_sra.sort

And another using curl from EBI:

curl ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR764/SRR764858/SRR764858.fastq.gz \
| pigz -d | fastq_quality_trimmer -t 20 -l 25 -Q33 \
| olego -t 8 Athaliana.TAIR10.23.dna_sm.genome.fa - \
| samtools view -uSh - \
| samtools sort - SRR764858_ebi.sort

Dump color-space sequence

Occasionally you'll come across data in color-space format. After downloading the SRA archive do the following.

abi-dump -A SRR1657115.sra

That will dump the sequence in fasta format (SRR1657115.sra.csfasta) and the quality string (SRR1657115.sra.qual) in separate files. Then I use solid-trimmer.py to do quality trimming. Here's an example:

solid-trimmer.py -c SRR1657115.sra.csfast -q SRR1657115.sra.qual --moving-average 7:12 --min-read-length 25 > SRR1657115.fasta
Happy data mining!