By the time I started my first postdoc, technology had moved on a little. We still did Sanger sequencing but the radioactive label had been replaced with coloured dyes and a laser scanner, which allowed automated reading of the sequence. During my second postdoc, this same technology was being applied to the shotgun sequencing of complete bacterial genomes. Assembling the sequence reads into contigs was pretty straightforward: there were a few software packages around, but most people used a pipeline of Phred (to call base qualities), Phrap (to assemble the reads) and Consed (for manual editing and gap-filling strategy).
The last time I worked directly on a project with sequencing data was around 2005. Jump forward 5 years to the BioStar bioinformatics Q&A forum and you’ll find many questions related to sequencing. But not sequencing as I knew it. No, this is so-called next-generation sequencing, or NGS. Suddenly, I realised that I am no longer a sequencing expert. In fact:
I am a relic from the Sanger era
I resolved to improve this state of affairs. There is plenty of publicly-available NGS data, some of it relevant to my current work and my organisation is predicting a move away from microarrays and towards NGS in 2012. So I figured: what better way to educate myself than to blog about it as I go along?
This is part 1 of a 4-part series and in this installment, we’ll look at how to get hold of public NGS data.
For these blog posts, I thought we’d stay with the theme of #arseniclife. A draft genome sequence for the organism in question, Halomonas sp. GFAJ-1, was recently assembled from NGS data and made publicly available.
My starting point is this page at the NCBI Short Read Archive (SRA), a database saved from closure this year when the curators secured extra funding. There’s a lot of useful information on this page, some of which is hidden under “more” links.
First, details of the DNA library:
Strategy: WGS Source: GENOMIC Selection: RANDOM Layout: PAIRED, Orientation: , Nominal length: 150, Nominal Std Dev: 30
Second, details of the sequencing platform:
Instrument model: Illumina HiSeq 2000 Spot descriptor: -> 1 forward 102 reverse <- Total: 1 run, 65.1M spots, 13.2G bases Download reads for this experiment in sra(7.0G) or sra-lite(7.0G) formats # Run # of Spots # of Bases 1. SRR385952 65,107,412 13.2G
Data are available in 2 formats: sra and sra-lite. There isn’t much to choose between them, except that sra contains some extra information such as intensity scores. Note that both are quite large files: after processing they’ll become even larger. So the first rule of NGS work: you need powerful hardware with plenty of storage and memory. I’m currently working on a shared server with access to 2 TB of storage, 96 GB RAM and 24 CPUs – and it’s just about enough.
There are software packages which handle SRA format, but most people convert to a widely-used NGS format called FASTQ. To do this we use the SRA toolkit provided by NCBI. There’s no pre-compiled 64-bit version, but it was easy enough to compile on my machine, just by following the instructions in README-build:
cd wget http://trace.ncbi.nlm.nih.gov/Traces/sra/static/sra_sdk-2.1.8.tar.gz tar zxvf sra_sdk-2.1.8.tar.gz cd sra_sdk-2.1.8 OUTDIR=/home/neil/sra make OUTDIR="$OUTDIR" out make static make GCC debug make
This provides the program ~/sra/bin64/fastq-dump. We get the SRA file from the NCBI FTP site:
mkdir -p ~/projects/gfaj1 cd ~/projects/gfaj1 wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX109/SRX109792/SRR385952/SRR385952.sra
Now the fun begins. First, don’t do this:
~/sra/bin64/fastq-dump SRR385952.sra
If you do and then examine the resulting file SRR385952.fastq, you’ll see sequences with length = 202. However, you’ll see from the sequencing platform details earlier that there should be forward and reverse sequences, each with length = 101. These are joined in the SRA file and need to be split, which you can do in one of two ways:
# 1 file ~/sra/bin64/fastq-dump --split-spot SRR385952.sra # 2 files ~/sra/bin64/fastq-dump --split-files SRR385952.sra
Note: the documentation for –split-files is confusing. It reads “Dump each read into separate file.” What that actually means is: dump into forward and reverse files. Note also: very poor documentation is a hallmark of NGS software; we’ll see plenty more of it later.
I ran fastq-dump with the –split-spot option, which generates a single file with the reverse read of each pair below the forward read for that pair. Whichever way you do it, there’s a problem:
head -8 SRR385952.fastq
@SRR385952.1 HWI-ST484_0123:8:1101:1154:2066 length=101 ATNCGTCCTTTATTCTGCCAGGGAATTGGCCCGTTCCGCTGGGCAGCGCTTTCCGGCGACCCGGAAGATATTTACAAAACCGACCAGAAGGTCAAAGAGCT +SRR385952.1 HWI-ST484_0123:8:1101:1154:2066 length=101 BC#4ADDFHHHHHJHIJJJJJJJIJJIJJJJJJJJHGGHIJJJJIJJJIJJHHHHFEDDDDDDDDDDDDDDEECDCCCBDBD>DDDDDDDDCCCCCDDDDA @SRR385952.1 HWI-ST484_0123:8:1101:1154:2066 length=101 GGTCGTCGGGGATCAGCTCTTTGACCTTCTGGTCGGTTTTGTAAATATCTTCCGGGTCGCCGGAAAGCGCTGCCCAGCGGAACGGGCCAATTCCCTGGCAG +SRR385952.1 HWI-ST484_0123:8:1101:1154:2066 length=101 CCBFFFFFHHHHHJJJJJJJJJJJJIIJJJJIEHIJHIJJIHHIJJIIGIIIHHHFDDDDDDDDDDDDDDBDDDDDDDDDBDDDDDDDDBDDCDDDDBDDD
Each sequence in a FASTQ file is described using 4 lines: a header, the sequence, a header for quality scores and the scores themselves. So by using head -8 we see the first pair of forward + reverse reads. Unfortunately, they have the same ID in the header line.
One convention for identifying paired reads is to suffix the ID using either /1 or /2. We can do that using sed:
sed -i '1~8s/^@SRR385952\.[0-9]*/&\/1/' SRR385952.fastq sed -i '5~8s/^@SRR385952\.[0-9]*/&\/2/' SRR385952.fastq
To explain that: headers for forward reads are on lines 1, 9, 17… and those for reverse on lines 5, 13, 21… So using sed, we append “/1” to the ID on lines 1, 9, 17… and “/2” to the ID on lines 5, 13, 21… Result:
head -8 SRR385952.fastq
@SRR385952.1/1 HWI-ST484_0123:8:1101:1154:2066 length=101 ATNCGTCCTTTATTCTGCCAGGGAATTGGCCCGTTCCGCTGGGCAGCGCTTTCCGGCGACCCGGAAGATATTTACAAAACCGACCAGAAGGTCAAAGAGCT +SRR385952.1 HWI-ST484_0123:8:1101:1154:2066 length=101 BC#4ADDFHHHHHJHIJJJJJJJIJJIJJJJJJJJHGGHIJJJJIJJJIJJHHHHFEDDDDDDDDDDDDDDEECDCCCBDBD>DDDDDDDDCCCCCDDDDA @SRR385952.1/2 HWI-ST484_0123:8:1101:1154:2066 length=101 GGTCGTCGGGGATCAGCTCTTTGACCTTCTGGTCGGTTTTGTAAATATCTTCCGGGTCGCCGGAAAGCGCTGCCCAGCGGAACGGGCCAATTCCCTGGCAG +SRR385952.1 HWI-ST484_0123:8:1101:1154:2066 length=101 CCBFFFFFHHHHHJJJJJJJJJJJJIIJJJJIEHIJHIJJIHHIJJIIGIIIHHHFDDDDDDDDDDDDDDBDDDDDDDDDBDDDDDDDDBDDCDDDDBDDD
There’s one more thing to note about fastq-dump. By default, the quality score offset is 33. What does that mean? Let’s look at the first five characters in the quality line for the first sequence:
character ASCII code (decimal) ASCII code - 33 B 66 33 C 67 34 # 35 2 4 52 19 A 65 32
So to get the quality score, just subtract 33 from the decimal ASCII code of the character. We can check our results by looking at the read back at the SRA website; see image, left.
Schemes for FASTQ encoding of quality scores are, apparently, a bit of a mess. We’ll see in later posts that some software tools expect certain encodings (and even have completely undocumented features to deal with other encoding schemes). For now though, it’s enough to know that our FASTQ file is in agreement with the SRA data and move on. |
So what have we achieved?
ls -l SRR385952.fastq
-rw-r--r-- 1 neil neil 43719072816 2011-12-22 14:56 SRR385952.fastq
We’ve generated a large (~ 44 GB) FASTQ file, containing paired reads with unique header IDs. It was quite a bit of work simply to fetch and format the raw data.
In the next installment: manipulation of FASTQ files and sequence quality assessment.
Nice to see that picture again…
An excellent blog post, I’d not really thought through the implications of NGS in this way. Thanks Neil, I’m now going to have to rewrite one of my lectures :)
Always glad to help :) And secretly, I enjoyed seeing and using that image again too. First time I’ve opened the thesis since…well, I wrote it.
Hi Neil,
This is an excellent way for people like me who are not into NGS to learn about this. I have few (silly?) questions:
1. Why is the offset score set to 33?
2. What do the the letters in line 4 and 8 represent, other than quality scores?
Looking forward to the other three posts…
Thanks
Good questions, which I hope I can answer – I’m an NGS beginner too!
The offset of 33 gives a Phred quality score. ASCII codes from 33-126 are used, allowing scores from 0 – 93. However, other encodings can be used. I guess the idea of using ASCII codes is to compress the data (single letters = 1 byte, double digits = 2 bytes). This is all covered on the Wikipedia page for FASTQ format.
So far as I know, letters in line 4 and 8 (and 12, 16, 20…) have no other meaning other than quality score. However, the symbols in the header lines (1 and 3) can carry meaning about the platform used – again, see the Wikipedia FASTQ entry.
Also consider Aspera for faster downloads from SRA/NCBI. I did a short post on bulk downloading that can help speed things up.
Hey Neil,
Nice post. I second the use of Aspera. Illumina has a nasty habit of changing ASCII offsets between platform/pipeline upgrades. Are you going to expand on the meaning/utility of paired-end reads? The definition of paired-end reads differs between Illumina and 454. In terms of sequence quality metrics for Illumina data something that is often overlooked/missed is s.d. of the paired-end spacer distances. A wide or bimodal spacer distance distribution can cause downstream assembly issues.
Thanks for the comments: I’ll try to address paired-ends in the next post, when I understand them better myself.
No problems. I’m very interested in this series. Particular interest/flags are 1) the high (170X) coverage 2) The use of the MIRA assembler which is of an Overlap-layout-consensus paradigm 3) The small number of contigs (NCBI does not allow submission of scaffolds). When I have a chance and to complement what you’re are doing I’ll assemble the reads with my pet assembler (Ray). I’ve also been developing an in-house annotation system and could compare its results vs the NCBI automated annotation system.
Excellent! I’ve been mucking around with a few assemblers using the GFAJ-1 data and was hoping someone else might be too. Haven’t got very far yet since I’m (a) still learning how best to use existing assemblers and (b) yet to filter the reads effectively but so far, Abyss was the most promising.