In my opinion, yes. Let me elaborate.
My current job is very much focused on “data integration”. What this means is that we have a large amount of diverse data from different “-omics” experiments: microarrays, protein mass spectrometry, DNA sequencing – really, whatever you like, but it’s all aimed at answering the same question. Namely: which of these biological entities (transcripts, proteins, metabolites) are markers for various human disease states?
Somehow, we have to pull all of these data into a common framework so that it can be analysed using statistics. The problem: whilst a lot of effort has gone into designing schema and ontologies that describe the individual data types, less effort has been applied to the question: what do all these things have in common?
Let’s think for a moment about what that means, practically. If I want to integrate sequence data – say, chromosomes, transcripts and microarray probes, I might go to the UCSC genome database or to Ensembl, from where I could fetch XML via DAS, parse it and store the data in a local database. I might then go to GEO or ArrayExpress, fetch microarray assay data and add that to my local store. Similarly, there are XML formats for proteomics and ontologies such as the Sequence Ontology, that I might incorporate.
That’s a lot of work and a lot of things that I have to know. I have to understand each schema/ontology, write parsers to process them and figure out what the relationships in my local datastore should look like. All that before I even fetch local records, pipe them into local code and generate useful analyses. It’s all too easy to become bogged down in the details of things like database table design and parser code.
I’ve recently experienced either an epiphany or a trivial, dumb idea. Which is this: data integration is so problematic because we have made life too complicated. We’ve become so wrapped up in the minutiae of individual types of biological data that we’ve forgotten how it all fits together – or the point of collecting it all in the first place.
So here’s my idea. In order to integrate data, we need to devise a really simple description of an experiment – and it should apply to almost any experiment. My current model is called the feature-probe-value approach. It goes like this:
- The feature is the biological entity that we want to measure
- The probe is an entity that maps to the feature and measures it, indirectly
- The value is the measurement reported by the probe
A microarray is an obvious example: feature = transcript, probe = labelled oligonucleotide, value = intensity. It works for proteomics too: feature = protein, probe = peptide, value = m/z. In fact, a lot of biological data involves the indirect measurement of a “real-world” entity: transcript, protein, metabolite – using a probe, which returns a value.
My hope is that storing data using this simple model should make it much easier to mine and analyse. For example, if the level of transcript X shows an interesting or significant change under condition A, you want to be able to ask: how about condition B? Or C? Or any other experiment involving that transcript (= feature)? Or experiments involving the product of that transcript? It should be possible to retrieve values across related features and assess them rapidly: they go up, or down, or are unchanged, given these conditions.
My next challenge is to implement some of this and see how it works out. I’ll keep you posted.
Update: a lively and enjoyable FriendFeed thread ensues