Sequencing for relics from the Sanger era part 1: getting the raw data

Sequencing in the good old days

In another life, way back in the mists of time, I did a Ph.D. Part of my project was to sequence a bacterial gene which encoded an enzyme involved in nitrite metabolism. It took the best part of a year to obtain ~ 2 000 bp of DNA sequence: partly because I was rubbish at sequencing, but also because of the technology at the time. It was an elegant biochemical technique called the dideoxy chain termination method, or “Sanger sequencing” after its inventor. Sequence was visualized by exposing radioactively-labelled DNA to X-ray film, resulting in images like the one at left, from my thesis. Yes, that photograph is glued in place. The sequence was read manually, by placing the developed film on a light box, moving a ruler and writing down the bases.

By the time I started my first postdoc, technology had moved on a little. We still did Sanger sequencing but the radioactive label had been replaced with coloured dyes and a laser scanner, which allowed automated reading of the sequence. During my second postdoc, this same technology was being applied to the shotgun sequencing of complete bacterial genomes. Assembling the sequence reads into contigs was pretty straightforward: there were a few software packages around, but most people used a pipeline of Phred (to call base qualities), Phrap (to assemble the reads) and Consed (for manual editing and gap-filling strategy).

The last time I worked directly on a project with sequencing data was around 2005. Jump forward 5 years to the BioStar bioinformatics Q&A forum and you’ll find many questions related to sequencing. But not sequencing as I knew it. No, this is so-called next-generation sequencing, or NGS. Suddenly, I realised that I am no longer a sequencing expert. In fact:

I am a relic from the Sanger era

I resolved to improve this state of affairs. There is plenty of publicly-available NGS data, some of it relevant to my current work and my organisation is predicting a move away from microarrays and towards NGS in 2012. So I figured: what better way to educate myself than to blog about it as I go along?

This is part 1 of a 4-part series and in this installment, we’ll look at how to get hold of public NGS data.
Read the rest…

Draft genomes: pros and cons

2x genomes—Does depth matter?

Cat joins exclusive genome club – I’m waiting for RPM to respond to the phrase “have its DNA decoded”.

Interesting commentary in Genome Research by Phil Green, of Phrap fame. He argues that whilst useful information can be gleaned from low-coverage genome sequence, we are surely missing the most interesting regions:

More seriously, since very few features of any appreciable size (e.g., genes) will be completely covered, analyses requiring complete features cannot be carried out. In addition, as was noted above, whole-genome assemblies (of any depth) often fail to incorporate a significant fraction of the repetitive sequence in the genome. This is often considered to be a relatively minor deficiency, which may be true so long as the primary research focus is on broadly shared biological features. However, it is now apparent that repetitive sequence is a key agent of evolutionary change: Segmental duplications are likely the primary source of new genes…and recent evidence strongly suggests that transposable elements are important mechanisms of regulatory innovation…As researchers’ attention turns toward understanding differences among organisms rather than similarities, inadequate coverage of these types of sequences will become increasingly problematic.

Article concludes: “A major function of the 2x genomes will no doubt prove to be whetting users’ appetites for more complete sequences.”