Algorithms running day and night

Warning: contains murky, somewhat unstructured thoughts on large-scale biological data analysis

Picture this. It’s based on a true story: names and details altered.

Alice, a biomedical researcher, performs an experiment to determine how gene expression in cells from a particular tissue is altered when the cells are exposed to an organic compound, substance Y. She collates a list of the most differentially-expressed genes and notes, in passing, that the expression of Gene X is much lower in the presence of substance Y.

Bob, a bioinformatician in the same organisation but in a different city to Alice, is analysing a public dataset. This experiment looks at gene expression in the same tissue but under different conditions: normal compared with a disease state, Z Syndrome. He also notes that Gene X appears in his list – its expression is much higher in the diseased tissue.

Alice and Bob attend the annual meeting of their organisation, where they compare notes and realise the potential significance of substance Y in suppressing the expression of Gene X and so perhaps relieving the symptoms of Z syndrome. On hearing this the head of the organisation, Charlie, marvels at the serendipitous nature of the discovery. Surely, he muses, given the amount of publicly-available experimental data, there must be a way to automate this kind of discovery by somehow “cross-correlating” everything with everything else until patterns emerge. What we need, states Charlie, is:

Algorithms running day and night, crunching all of that data

What’s Charlie missing?
Read the rest…

Science in the petabyte era

Just a brief note: the title of this post is taken from the cover of today’s Nature. It contains several very good feature articles on the challenges of dealing with peta- (and more) byte size datasets, grouped under the heading “Big data”.

Nature contents Sep 4 2008.
Nature News Big Data special.

By far the best of the articles is The future of biocuration: it offers practical recommendations, as opposed to the “gee whizz, what a lot of data” approach. Not least of which: “curators, researchers, academic institutions and funding agencies should, in the next ten years, increase the visibility and support of scientific curation as a professional career.”
Almost as good are Wikiomics, which tackles the lack of participation issue and Welcome to the petacentre, in which Boing-Boing’s Cory Doctorow explores, amongst other places, the Sanger Institute data centre.

So far as I can tell from the Nature News link, these articles are freely-available.