Conferences are good in that they get you thinking about research. Today, I was dwelling on a phrase that’s been going around a lot lately: “data-driven science versus hypothesis-driven science”.
My understanding of it goes like this. In the pre-bioinformatics days, you gathered discrete pieces of experimental evidence over long time periods. For example, you might be studying protein X in order to figure out its function. You might do a biochemical assay for protein X, or knock out the gene encoding protein X and observe the results. These allow you to form a hypothesis – “I think that protein X is involved with process Y”. More experiments confirm or refute this idea. Along the way, you start noticing that where there is protein X, there’s also protein Z. More hypotheses: “protein X binds protein Z”, “protein Z is a substrate for protein X” and so on. More experiments.
Nowadays, we have a large amount of information stored in freely-accessible databases: sequence, protein domains, subcellular localisation, GO terms and so on. What this means effectively is that the databases contain predictions and hypotheses. For instance, knowing that two proteins occur in the same compartment and interact in one organism, we’d predict the same in another organism given the same two proteins and compartment.
The problem is that these predictions are not formalised. They just occur to us as we’re browsing through the databases. Many of them are just hunches with little experimental evidence, but they could be confirmed or refuted by a simple experiment. For instance, I’m convinced that I’ve found a key enzyme for biogenesis of a methanogenic enzyme, but I’ll never know until someone does the experiment. So here are my ideas:
- An online repository of data-driven hypotheses. If you’ve observed something interesting in a database, you announce it to the world (thus staking your claim) and interested collaborators, with the means and desire to confirm your idea experimentally, can contact you and arrange collaboration.
- A semi-automated way of making biologically-interesting predictions from databases. I’m not sure how this would work – some text mining might be involved as well as a lot of “prior knowledge” about biological systems. I guess the simplest case would be “prediction by analogy” – knowing that organism A has process B that uses components X, Y and Z, you’d predict process B in organism C if that also has components X, Y and Z.
To my knowledge, neither of these ideas have been put into practice yet – correct me if I’m wrong.