Data-driven research

Conferences are good in that they get you thinking about research. Today, I was dwelling on a phrase that’s been going around a lot lately: “data-driven science versus hypothesis-driven science”.

My understanding of it goes like this. In the pre-bioinformatics days, you gathered discrete pieces of experimental evidence over long time periods. For example, you might be studying protein X in order to figure out its function. You might do a biochemical assay for protein X, or knock out the gene encoding protein X and observe the results. These allow you to form a hypothesis – “I think that protein X is involved with process Y”. More experiments confirm or refute this idea. Along the way, you start noticing that where there is protein X, there’s also protein Z. More hypotheses: “protein X binds protein Z”, “protein Z is a substrate for protein X” and so on. More experiments.

Nowadays, we have a large amount of information stored in freely-accessible databases: sequence, protein domains, subcellular localisation, GO terms and so on. What this means effectively is that the databases contain predictions and hypotheses. For instance, knowing that two proteins occur in the same compartment and interact in one organism, we’d predict the same in another organism given the same two proteins and compartment.

The problem is that these predictions are not formalised. They just occur to us as we’re browsing through the databases. Many of them are just hunches with little experimental evidence, but they could be confirmed or refuted by a simple experiment. For instance, I’m convinced that I’ve found a key enzyme for biogenesis of a methanogenic enzyme, but I’ll never know until someone does the experiment. So here are my ideas:

  • An online repository of data-driven hypotheses. If you’ve observed something interesting in a database, you announce it to the world (thus staking your claim) and interested collaborators, with the means and desire to confirm your idea experimentally, can contact you and arrange collaboration.
  • A semi-automated way of making biologically-interesting predictions from databases. I’m not sure how this would work – some text mining might be involved as well as a lot of “prior knowledge” about biological systems. I guess the simplest case would be “prediction by analogy” – knowing that organism A has process B that uses components X, Y and Z, you’d predict process B in organism C if that also has components X, Y and Z.

To my knowledge, neither of these ideas have been put into practice yet – correct me if I’m wrong.

2 thoughts on “Data-driven research

  1. I think that these ideas are being put into practice on some science blogs and wikis. In the bio world Rosie Redfield’s blog comes to mind as being one to be explicit with her ideas and hypotheses.

    However, for the ability to stake your claim, I think that hosted wikis are best because, among other things, they offer third party time stamps.

    I also agree with you that automation and semi-automation are going to lead to some very interesting ways to move science forward. We are trying to implement this in chemistry with the UsefulChem project. I just posted a brief screencast of a talk I gave this week on that topic that might be of interest. There is also a reference on one of the first few slides to Ross King’s Robot Scientist, which is capable of generating hypotheses and executing experiments on the yeast genome.

  2. There are functional prediction servers like STRING and bioPIXIE that give you the probability that two proteins are functionally related. The predictions come from the statistical integration of several different sources of data. They do it by weighting each data source on it’s ability to predict functional association and then integrating them all into one probabilist score. I guess this is what comes closer to semi-automatic biological predictions. I agree this is a good way forward. Now they have to come up with more specific predictions. The two cellular components are functionally related in what way ? That is a matter of getting good predictors for the different hypothesis that could be interesting. Binding, pos-translational modification, localization , etc.

Comments are closed.