January 27, 2012

Reproducible research: three links that made me think

I’m constantly amazed, bemused and troubled by how little published scientific research is genuinely reproducible, in that you or I (or even the original authors) could go back and check the results. Three examples from around the Web converged in my mind this week.
Read the rest…

January 1, 2012

2011 blog stats courtesy of WordPress.com

The kind people at WordPress.com have prepared a 2011 annual report for this blog.


Click here to see the complete report.

December 22, 2011

Sequencing for relics from the Sanger era part 1: getting the raw data

Sequencing in the good old days

In another life, way back in the mists of time, I did a Ph.D. Part of my project was to sequence a gene from a bacterium, which encoded an enzyme involved in nitrate metabolism. It took the best part of a year to obtain ~ 2 000 bp of DNA sequence: partly because I was rubbish at sequencing, but also because of the technology at the time. It was an elegant biochemical technique called the dideoxy chain termination method, or “Sanger sequencing” after its inventor. Sequence was visualized by exposing radioactively-labelled DNA to X-ray film, resulting in images like the one at left, from my thesis. Yes, that photograph is glued in place. The sequence was read manually, by placing the developed film on a light box, moving a ruler and writing down the bases.

By the time I started my first postdoc, technology had moved on a little. We still did Sanger sequencing but the radioactive label had been replaced with coloured dyes and a laser scanner, which allowed automated reading of the sequence. During my second postdoc, this same technology was being applied to the shotgun sequencing of complete bacterial genomes. Assembling the sequence reads into contigs was pretty straightforward: there were a few software packages around, but most people used a pipeline of Phred (to call base qualities), Phrap (to assemble the reads) and Consed (for manual editing and gap-filling strategy).

The last time I worked directly on a project with sequencing data was around 2005. Jump forward 5 years to the BioStar bioinformatics Q&A forum and you’ll find many questions related to sequencing. But not sequencing as I knew it. No, this is so-called next-generation sequencing, or NGS. Suddenly, I realised that I am no longer a sequencing expert. In fact:

I am a relic from the Sanger era

I resolved to improve this state of affairs. There is plenty of publicly-available NGS data, some of it relevant to my current work and my organisation is predicting a move away from microarrays and towards NGS in 2012. So I figured: what better way to educate myself than to blog about it as I go along?

This is part 1 of a 4-part series and in this installment, we’ll look at how to get hold of public NGS data.
Read the rest…

December 14, 2011

#arseniclife: the genome

It’s about one year since the science story dubbed #arseniclife hit the headlines. November 30th saw the release of a draft genome sequence for Halomonas sp. GFAJ-1, the bacterium behind the furore.

As Iddo pointed out on Twitter, sequencing the DNA from GFAJ-1 is itself strong evidence against arsenate in the DNA backbone, since the sequencing chemistry would be highly unlikely to work in that case. However, if like me you think that a new microbial genome provides the most fun to be had in bioinformatics [*], you’ll be excited by the availability of the data.

In this post then: where to get it, some very preliminary analysis and some things that you might like to to with it. Projects for your students, perhaps.

[*] note to self: why, then, am I working on colorectal cancer?
Read the rest…

December 2, 2011

A Friday round-up

Just a brief selection of items that caught my eye this week. Note that this is a Friday as opposed to Friday, lest you mistake this for a new, regular feature.

1. R/statistics

  • ggbio
  • A new Bioconductor package which builds on the excellent ggplot graphics library, for the visualization of biological data.

  • R development master class
  • Hadley Wickham recently presented this course on R package development for my organisation. I was on parental leave at the time, otherwise I would have attended for sure.

2. Bioinformatics in the media
DNA Sequencing Caught in Deluge of Data

I described this NYT article as a “surprisingly-good intro article“. Michael Eisen described it as “kind of silly“.

I think we’re both right. Michael’s perspective is that of an expert in high-throughput sequencing data; I’m just pleased to see an introduction to bioinformatics for non-specialists in a mainstream newspaper. And I note that they have corrected the figure caption which offended Michael.

As to the “deluge”: yes, there are other sciences that generate more data and yes, we probably don’t need to archive/analyse a lot of the raw data. However, I’d contend that the basic premise of the article is correct: we are sequencing faster than we can analyse. The solution, obviously, is more bioinformaticians.

November 24, 2011

Boring, monotonous day-to-day tasks? That’s synonymous with bioinformatics.

In response to this question, I can only point out that J.C.R. Licklider figured it out over 50 years ago:

Despite the fact that there is a voluminous literature on thinking and problem solving, including intensive case-history studies of the process of invention, I could find nothing comparable to a time-and-motion-study analysis of the mental work of a person engaged in a scientific or technical enterprise. In the spring and summer of 1957, therefore, I tried to keep track of what one moderately technical person actually did during the hours he regarded as devoted to work. Although I was aware of the inadequacy of the sampling, I served as my own subject.

It soon became apparent that the main thing I did was to keep records, and the project would have become an infinite regress if the keeping of records had been carried through in the detail envisaged in the initial plan. It was not. Nevertheless, I obtained a picture of my activities that gave me pause. Perhaps my spectrum is not typical–I hope it is not, but I fear it is.

About 85 per cent of my “thinking” time was spent getting into a position to think, to make a decision, to learn something I needed to know. Much more time went into finding or obtaining information than into digesting it. Hours went into the plotting of graphs, and other hours into instructing an assistant how to plot. When the graphs were finished, the relations were obvious at once, but the plotting had to be done in order to make them so. At one point, it was necessary to compare six experimental determinations of a function relating speech-intelligibility to speech-to-noise ratio. No two experimenters had used the same definition or measure of speech-to-noise ratio. Several hours of calculating were required to get the data into comparable form. When they were in comparable form, it took only a few seconds to determine what I needed to know.

Throughout the period I examined, in short, my “thinking” time was devoted mainly to activities that were essentially clerical or mechanical: searching, calculating, plotting, transforming, determining the logical or dynamic consequences of a set of assumptions or hypotheses, preparing the way for a decision or an insight. Moreover, my choices of what to attempt and what not to attempt were determined to an embarrassingly great extent by considerations of clerical feasibility, not intellectual capability.

September 8, 2011

Interacting with bioinformatics webservers using R

In an ideal world, all bioinformatics tools would be made available via the Web as a web service with an API, as well as a standalone package to download for local use. This is rarely the case and sometimes, even where one or the other is available, factors such as cost come into play. So we resort to web scraping; writing code to interact with the code that lies behind a web server so as to submit queries, retrieve and parse results.

Normally, I’d use something like Ruby’s Mechanize library for this purpose. However, where the purpose is to retrieve delimited data for analysis using R, I figured it was time to try and achieve the entire process within R. So here’s how I used the RCurl and XML packages to interact with the WHAT IF server, which provides tools for the analysis of protein structure.
Read the rest…

August 23, 2011

Popular topics at the BioStar Q&A site

Which topics are the most popular at the BioStar bioinformatics Q&A site?

One source of data is the tags used for questions. Tags are somewhat arbitrary of course, but fortunately BioStar has quite an active community, so “bad” tags are usually edited to improve them. Hint: if your question is “How to find SNPs”, then tagging it with “how, to, find, snps” won’t win you any admirers.

OK: we’re going to grab the tags then use a bunch of R packages (XML, wordcloud and ggplot2) to take a quick look.

Read the rest…

August 16, 2011

Monitoring PubMed retractions: updates

chart

PubMed cumulative retractions 1977-present

There’s been a recent flurry of interest in retractions. See for example: Scientific Retractions: A Growth Industry?; summarised also by GenomeWeb in Take That Back; articles in the WSJ and the Pharmalot blog; and academic articles in the Journal of Medical Ethics and Infection & Immunity.

Several of these sources cite data from my humble web application, PMRetract. So now seems like a good time to mention that:

  • The application is still going strong and is updated regularly
  • I’ve added a few enhancements to the UI; you can follow development at GitHub
  • I’ve also added a long-overdue about page with some extra information, including the fact that I wrote it :)

Now I just need to fix up my Git repositories. Currently there’s one which pushes to GitHub and a second, with a copy of the Sinatra code for pushing to Heroku, which isn’t too smart.

August 5, 2011

BioRuby development: feedback on using Git

Everyone likes constructive feedback. I received a couple of great comments on my previous post, which warrant a brief discussion.

@vlandham points out that when the main BioRuby repository updates, you’ll want to update your local repository. Using git, you do that by adding a remote which points to the original repository, from which you can fetch updates and merge with your local version:

git remote add upstream https://github.com/bioruby/bioruby.git
# fetch/merge only when main repo updates
git fetch upstream
git merge upstream master

This is described at the GitHub help page Fork A Repo.

Michael points to an article titled A successful Git branching model. It suggests that when developing new features you create a feature branch (also called topic branch). This can help with the management of new features and creates a more complete commit history if/when the new feature is merged back into your development repository. The article also suggests a main branch for development named develop, rather than the default master.

I haven’t quite got my head around all the ins-and-outs of the article yet, but it’s well worth a read.

Follow

Get every new post delivered to your Inbox.

Join 1,292 other followers