By the time I started my first postdoc, technology had moved on a little. We still did Sanger sequencing but the radioactive label had been replaced with coloured dyes and a laser scanner, which allowed automated reading of the sequence. During my second postdoc, this same technology was being applied to the shotgun sequencing of complete bacterial genomes. Assembling the sequence reads into contigs was pretty straightforward: there were a few software packages around, but most people used a pipeline of Phred (to call base qualities), Phrap (to assemble the reads) and Consed (for manual editing and gap-filling strategy).
The last time I worked directly on a project with sequencing data was around 2005. Jump forward 5 years to the BioStar bioinformatics Q&A forum and you’ll find many questions related to sequencing. But not sequencing as I knew it. No, this is so-called next-generation sequencing, or NGS. Suddenly, I realised that I am no longer a sequencing expert. In fact:
I am a relic from the Sanger era
I resolved to improve this state of affairs. There is plenty of publicly-available NGS data, some of it relevant to my current work and my organisation is predicting a move away from microarrays and towards NGS in 2012. So I figured: what better way to educate myself than to blog about it as I go along?
This is part 1
of a 4-part series and in this installment, we’ll look at how to get hold of public NGS data.
For these blog posts, I thought we’d stay with the theme of #arseniclife. A draft genome sequence for the organism in question, Halomonas sp. GFAJ-1, was recently assembled from NGS data and made publicly available.
My starting point is this page at the NCBI Short Read Archive (SRA), a database saved from closure this year when the curators secured extra funding. There’s a lot of useful information on this page, some of which is hidden under “more” links.
First, details of the DNA library:
Strategy: WGS Source: GENOMIC Selection: RANDOM Layout: PAIRED, Orientation: , Nominal length: 150, Nominal Std Dev: 30
Second, details of the sequencing platform:
Instrument model: Illumina HiSeq 2000 Spot descriptor: -> 1 forward 102 reverse <- Total: 1 run, 65.1M spots, 13.2G bases Download reads for this experiment in sra(7.0G) or sra-lite(7.0G) formats # Run # of Spots # of Bases 1. SRR385952 65,107,412 13.2G
Data are available in 2 formats: sra and sra-lite. There isn’t much to choose between them, except that sra contains some extra information such as intensity scores. Note that both are quite large files: after processing they’ll become even larger. So the first rule of NGS work: you need powerful hardware with plenty of storage and memory. I’m currently working on a shared server with access to 2 TB of storage, 96 GB RAM and 24 CPUs – and it’s just about enough.
There are software packages which handle SRA format, but most people convert to a widely-used NGS format called FASTQ. To do this we use the SRA toolkit provided by NCBI. There’s no pre-compiled 64-bit version, but it was easy enough to compile on my machine, just by following the instructions in README-build:
cd wget http://trace.ncbi.nlm.nih.gov/Traces/sra/static/sra_sdk-2.1.8.tar.gz tar zxvf sra_sdk-2.1.8.tar.gz cd sra_sdk-2.1.8 OUTDIR=/home/neil/sra make OUTDIR="$OUTDIR" out make static make GCC debug make
This provides the program ~/sra/bin64/fastq-dump. We get the SRA file from the NCBI FTP site:
mkdir -p ~/projects/gfaj1 cd ~/projects/gfaj1 wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX109/SRX109792/SRR385952/SRR385952.sra
Now the fun begins. First, don’t do this:
If you do and then examine the resulting file SRR385952.fastq, you’ll see sequences with length = 202. However, you’ll see from the sequencing platform details earlier that there should be forward and reverse sequences, each with length = 101. These are joined in the SRA file and need to be split, which you can do in one of two ways:
# 1 file ~/sra/bin64/fastq-dump --split-spot SRR385952.sra # 2 files ~/sra/bin64/fastq-dump --split-files SRR385952.sra
Note: the documentation for –split-files is confusing. It reads “Dump each read into separate file.” What that actually means is: dump into forward and reverse files. Note also: very poor documentation is a hallmark of NGS software; we’ll see plenty more of it later.
I ran fastq-dump with the –split-spot option, which generates a single file with the reverse read of each pair below the forward read for that pair. Whichever way you do it, there’s a problem:
head -8 SRR385952.fastq
@SRR385952.1 HWI-ST484_0123:8:1101:1154:2066 length=101 ATNCGTCCTTTATTCTGCCAGGGAATTGGCCCGTTCCGCTGGGCAGCGCTTTCCGGCGACCCGGAAGATATTTACAAAACCGACCAGAAGGTCAAAGAGCT +SRR385952.1 HWI-ST484_0123:8:1101:1154:2066 length=101 BC#4ADDFHHHHHJHIJJJJJJJIJJIJJJJJJJJHGGHIJJJJIJJJIJJHHHHFEDDDDDDDDDDDDDDEECDCCCBDBD>DDDDDDDDCCCCCDDDDA @SRR385952.1 HWI-ST484_0123:8:1101:1154:2066 length=101 GGTCGTCGGGGATCAGCTCTTTGACCTTCTGGTCGGTTTTGTAAATATCTTCCGGGTCGCCGGAAAGCGCTGCCCAGCGGAACGGGCCAATTCCCTGGCAG +SRR385952.1 HWI-ST484_0123:8:1101:1154:2066 length=101 CCBFFFFFHHHHHJJJJJJJJJJJJIIJJJJIEHIJHIJJIHHIJJIIGIIIHHHFDDDDDDDDDDDDDDBDDDDDDDDDBDDDDDDDDBDDCDDDDBDDD
Each sequence in a FASTQ file is described using 4 lines: a header, the sequence, a header for quality scores and the scores themselves. So by using head -8 we see the first pair of forward + reverse reads. Unfortunately, they have the same ID in the header line.
One convention for identifying paired reads is to suffix the ID using either /1 or /2. We can do that using sed:
sed -i '1~8s/^@SRR385952\.[0-9]*/&\/1/' SRR385952.fastq sed -i '5~8s/^@SRR385952\.[0-9]*/&\/2/' SRR385952.fastq
To explain that: headers for forward reads are on lines 1, 9, 17… and those for reverse on lines 5, 13, 21… So using sed, we append “/1” to the ID on lines 1, 9, 17… and “/2” to the ID on lines 5, 13, 21… Result:
head -8 SRR385952.fastq
@SRR385952.1/1 HWI-ST484_0123:8:1101:1154:2066 length=101 ATNCGTCCTTTATTCTGCCAGGGAATTGGCCCGTTCCGCTGGGCAGCGCTTTCCGGCGACCCGGAAGATATTTACAAAACCGACCAGAAGGTCAAAGAGCT +SRR385952.1 HWI-ST484_0123:8:1101:1154:2066 length=101 BC#4ADDFHHHHHJHIJJJJJJJIJJIJJJJJJJJHGGHIJJJJIJJJIJJHHHHFEDDDDDDDDDDDDDDEECDCCCBDBD>DDDDDDDDCCCCCDDDDA @SRR385952.1/2 HWI-ST484_0123:8:1101:1154:2066 length=101 GGTCGTCGGGGATCAGCTCTTTGACCTTCTGGTCGGTTTTGTAAATATCTTCCGGGTCGCCGGAAAGCGCTGCCCAGCGGAACGGGCCAATTCCCTGGCAG +SRR385952.1 HWI-ST484_0123:8:1101:1154:2066 length=101 CCBFFFFFHHHHHJJJJJJJJJJJJIIJJJJIEHIJHIJJIHHIJJIIGIIIHHHFDDDDDDDDDDDDDDBDDDDDDDDDBDDDDDDDDBDDCDDDDBDDD
There’s one more thing to note about fastq-dump. By default, the quality score offset is 33. What does that mean? Let’s look at the first five characters in the quality line for the first sequence:
character ASCII code (decimal) ASCII code - 33 B 66 33 C 67 34 # 35 2 4 52 19 A 65 32
So to get the quality score, just subtract 33 from the decimal ASCII code of the character. We can check our results by looking at the read back at the SRA website; see image, left.
Schemes for FASTQ encoding of quality scores are, apparently, a bit of a mess. We’ll see in later posts that some software tools expect certain encodings (and even have completely undocumented features to deal with other encoding schemes). For now though, it’s enough to know that our FASTQ file is in agreement with the SRA data and move on.
So what have we achieved?
ls -l SRR385952.fastq
-rw-r--r-- 1 neil neil 43719072816 2011-12-22 14:56 SRR385952.fastq
We’ve generated a large (~ 44 GB) FASTQ file, containing paired reads with unique header IDs. It was quite a bit of work simply to fetch and format the raw data.
In the next installment: manipulation of FASTQ files and sequence quality assessment.