Sequencing for relics from the Sanger era part 1: getting the raw data

Sequencing in the good old days

In another life, way back in the mists of time, I did a Ph.D. Part of my project was to sequence a bacterial gene which encoded an enzyme involved in nitrite metabolism. It took the best part of a year to obtain ~ 2 000 bp of DNA sequence: partly because I was rubbish at sequencing, but also because of the technology at the time. It was an elegant biochemical technique called the dideoxy chain termination method, or “Sanger sequencing” after its inventor. Sequence was visualized by exposing radioactively-labelled DNA to X-ray film, resulting in images like the one at left, from my thesis. Yes, that photograph is glued in place. The sequence was read manually, by placing the developed film on a light box, moving a ruler and writing down the bases.

By the time I started my first postdoc, technology had moved on a little. We still did Sanger sequencing but the radioactive label had been replaced with coloured dyes and a laser scanner, which allowed automated reading of the sequence. During my second postdoc, this same technology was being applied to the shotgun sequencing of complete bacterial genomes. Assembling the sequence reads into contigs was pretty straightforward: there were a few software packages around, but most people used a pipeline of Phred (to call base qualities), Phrap (to assemble the reads) and Consed (for manual editing and gap-filling strategy).

The last time I worked directly on a project with sequencing data was around 2005. Jump forward 5 years to the BioStar bioinformatics Q&A forum and you’ll find many questions related to sequencing. But not sequencing as I knew it. No, this is so-called next-generation sequencing, or NGS. Suddenly, I realised that I am no longer a sequencing expert. In fact:

I am a relic from the Sanger era

I resolved to improve this state of affairs. There is plenty of publicly-available NGS data, some of it relevant to my current work and my organisation is predicting a move away from microarrays and towards NGS in 2012. So I figured: what better way to educate myself than to blog about it as I go along?

This is part 1 of a 4-part series and in this installment, we’ll look at how to get hold of public NGS data.

For these blog posts, I thought we’d stay with the theme of #arseniclife. A draft genome sequence for the organism in question, Halomonas sp. GFAJ-1, was recently assembled from NGS data and made publicly available.

My starting point is this page at the NCBI Short Read Archive (SRA), a database saved from closure this year when the curators secured extra funding. There’s a lot of useful information on this page, some of which is hidden under “more” links.

First, details of the DNA library:

Strategy:  WGS
Source:    GENOMIC
Selection: RANDOM
Layout:    PAIRED, Orientation: , Nominal length: 150, Nominal Std Dev: 30

Second, details of the sequencing platform:

Instrument model: Illumina HiSeq 2000
Spot descriptor:  -> 1  forward 102  reverse <-
Total:            1 run, 65.1M spots, 13.2G bases
    Download reads for this experiment in sra(7.0G) or sra-lite(7.0G) formats 
#       Run          # of Spots    # of Bases
1.      SRR385952    65,107,412    13.2G

Data are available in 2 formats: sra and sra-lite. There isn’t much to choose between them, except that sra contains some extra information such as intensity scores. Note that both are quite large files: after processing they’ll become even larger. So the first rule of NGS work: you need powerful hardware with plenty of storage and memory. I’m currently working on a shared server with access to 2 TB of storage, 96 GB RAM and 24 CPUs – and it’s just about enough.

There are software packages which handle SRA format, but most people convert to a widely-used NGS format called FASTQ. To do this we use the SRA toolkit provided by NCBI. There’s no pre-compiled 64-bit version, but it was easy enough to compile on my machine, just by following the instructions in README-build:

cd
wget http://trace.ncbi.nlm.nih.gov/Traces/sra/static/sra_sdk-2.1.8.tar.gz
tar zxvf sra_sdk-2.1.8.tar.gz
cd sra_sdk-2.1.8
OUTDIR=/home/neil/sra
make OUTDIR="$OUTDIR" out
make static
make GCC debug
make

This provides the program ~/sra/bin64/fastq-dump. We get the SRA file from the NCBI FTP site:

mkdir -p ~/projects/gfaj1
cd ~/projects/gfaj1
wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX109/SRX109792/SRR385952/SRR385952.sra

Now the fun begins. First, don’t do this:

~/sra/bin64/fastq-dump SRR385952.sra

If you do and then examine the resulting file SRR385952.fastq, you’ll see sequences with length = 202. However, you’ll see from the sequencing platform details earlier that there should be forward and reverse sequences, each with length = 101. These are joined in the SRA file and need to be split, which you can do in one of two ways:

# 1 file
~/sra/bin64/fastq-dump --split-spot SRR385952.sra
# 2 files
~/sra/bin64/fastq-dump --split-files SRR385952.sra

Note: the documentation for –split-files is confusing. It reads “Dump each read into separate file.” What that actually means is: dump into forward and reverse files. Note also: very poor documentation is a hallmark of NGS software; we’ll see plenty more of it later.

I ran fastq-dump with the –split-spot option, which generates a single file with the reverse read of each pair below the forward read for that pair. Whichever way you do it, there’s a problem:

head -8 SRR385952.fastq
@SRR385952.1 HWI-ST484_0123:8:1101:1154:2066 length=101
ATNCGTCCTTTATTCTGCCAGGGAATTGGCCCGTTCCGCTGGGCAGCGCTTTCCGGCGACCCGGAAGATATTTACAAAACCGACCAGAAGGTCAAAGAGCT
+SRR385952.1 HWI-ST484_0123:8:1101:1154:2066 length=101
BC#4ADDFHHHHHJHIJJJJJJJIJJIJJJJJJJJHGGHIJJJJIJJJIJJHHHHFEDDDDDDDDDDDDDDEECDCCCBDBD>DDDDDDDDCCCCCDDDDA
@SRR385952.1 HWI-ST484_0123:8:1101:1154:2066 length=101
GGTCGTCGGGGATCAGCTCTTTGACCTTCTGGTCGGTTTTGTAAATATCTTCCGGGTCGCCGGAAAGCGCTGCCCAGCGGAACGGGCCAATTCCCTGGCAG
+SRR385952.1 HWI-ST484_0123:8:1101:1154:2066 length=101
CCBFFFFFHHHHHJJJJJJJJJJJJIIJJJJIEHIJHIJJIHHIJJIIGIIIHHHFDDDDDDDDDDDDDDBDDDDDDDDDBDDDDDDDDBDDCDDDDBDDD

Each sequence in a FASTQ file is described using 4 lines: a header, the sequence, a header for quality scores and the scores themselves. So by using head -8 we see the first pair of forward + reverse reads. Unfortunately, they have the same ID in the header line.

One convention for identifying paired reads is to suffix the ID using either /1 or /2. We can do that using sed:

sed -i '1~8s/^@SRR385952\.[0-9]*/&\/1/' SRR385952.fastq
sed -i '5~8s/^@SRR385952\.[0-9]*/&\/2/' SRR385952.fastq

To explain that: headers for forward reads are on lines 1, 9, 17… and those for reverse on lines 5, 13, 21… So using sed, we append “/1” to the ID on lines 1, 9, 17… and “/2” to the ID on lines 5, 13, 21… Result:

head -8 SRR385952.fastq
@SRR385952.1/1 HWI-ST484_0123:8:1101:1154:2066 length=101
ATNCGTCCTTTATTCTGCCAGGGAATTGGCCCGTTCCGCTGGGCAGCGCTTTCCGGCGACCCGGAAGATATTTACAAAACCGACCAGAAGGTCAAAGAGCT
+SRR385952.1 HWI-ST484_0123:8:1101:1154:2066 length=101
BC#4ADDFHHHHHJHIJJJJJJJIJJIJJJJJJJJHGGHIJJJJIJJJIJJHHHHFEDDDDDDDDDDDDDDEECDCCCBDBD>DDDDDDDDCCCCCDDDDA
@SRR385952.1/2 HWI-ST484_0123:8:1101:1154:2066 length=101
GGTCGTCGGGGATCAGCTCTTTGACCTTCTGGTCGGTTTTGTAAATATCTTCCGGGTCGCCGGAAAGCGCTGCCCAGCGGAACGGGCCAATTCCCTGGCAG
+SRR385952.1 HWI-ST484_0123:8:1101:1154:2066 length=101
CCBFFFFFHHHHHJJJJJJJJJJJJIIJJJJIEHIJHIJJIHHIJJIIGIIIHHHFDDDDDDDDDDDDDDBDDDDDDDDDBDDDDDDDDBDDCDDDDBDDD

There’s one more thing to note about fastq-dump. By default, the quality score offset is 33. What does that mean? Let’s look at the first five characters in the quality line for the first sequence:

character    ASCII code (decimal)    ASCII code - 33
B            66                      33
C            67                      34
#            35                       2
4            52                      19
A            65                      32

SRA sequence view with quality scores

So to get the quality score, just subtract 33 from the decimal ASCII code of the character. We can check our results by looking at the read back at the SRA website; see image, left.

Schemes for FASTQ encoding of quality scores are, apparently, a bit of a mess. We’ll see in later posts that some software tools expect certain encodings (and even have completely undocumented features to deal with other encoding schemes). For now though, it’s enough to know that our FASTQ file is in agreement with the SRA data and move on.

So what have we achieved?

ls -l SRR385952.fastq
-rw-r--r-- 1 neil neil 43719072816 2011-12-22 14:56 SRR385952.fastq

We’ve generated a large (~ 44 GB) FASTQ file, containing paired reads with unique header IDs. It was quite a bit of work simply to fetch and format the raw data.

In the next installment: manipulation of FASTQ files and sequence quality assessment.

9 thoughts on “Sequencing for relics from the Sanger era part 1: getting the raw data

  1. Nice to see that picture again…
    An excellent blog post, I’d not really thought through the implications of NGS in this way. Thanks Neil, I’m now going to have to rewrite one of my lectures :)

    • Always glad to help :) And secretly, I enjoyed seeing and using that image again too. First time I’ve opened the thesis since…well, I wrote it.

  2. Hi Neil,
    This is an excellent way for people like me who are not into NGS to learn about this. I have few (silly?) questions:
    1. Why is the offset score set to 33?
    2. What do the the letters in line 4 and 8 represent, other than quality scores?
    Looking forward to the other three posts…
    Thanks

    • Good questions, which I hope I can answer – I’m an NGS beginner too!

      The offset of 33 gives a Phred quality score. ASCII codes from 33-126 are used, allowing scores from 0 – 93. However, other encodings can be used. I guess the idea of using ASCII codes is to compress the data (single letters = 1 byte, double digits = 2 bytes). This is all covered on the Wikipedia page for FASTQ format.

      So far as I know, letters in line 4 and 8 (and 12, 16, 20…) have no other meaning other than quality score. However, the symbols in the header lines (1 and 3) can carry meaning about the platform used – again, see the Wikipedia FASTQ entry.

  3. Hey Neil,

    Nice post. I second the use of Aspera. Illumina has a nasty habit of changing ASCII offsets between platform/pipeline upgrades. Are you going to expand on the meaning/utility of paired-end reads? The definition of paired-end reads differs between Illumina and 454. In terms of sequence quality metrics for Illumina data something that is often overlooked/missed is s.d. of the paired-end spacer distances. A wide or bimodal spacer distance distribution can cause downstream assembly issues.

  4. No problems. I’m very interested in this series. Particular interest/flags are 1) the high (170X) coverage 2) The use of the MIRA assembler which is of an Overlap-layout-consensus paradigm 3) The small number of contigs (NCBI does not allow submission of scaffolds). When I have a chance and to complement what you’re are doing I’ll assemble the reads with my pet assembler (Ray). I’ve also been developing an in-house annotation system and could compare its results vs the NCBI automated annotation system.

    • Excellent! I’ve been mucking around with a few assemblers using the GFAJ-1 data and was hoping someone else might be too. Haven’t got very far yet since I’m (a) still learning how best to use existing assemblers and (b) yet to filter the reads effectively but so far, Abyss was the most promising.

Comments are closed.