Has our quest for completeness made things too complicated?

In my opinion, yes. Let me elaborate.

My current job is very much focused on “data integration”. What this means is that we have a large amount of diverse data from different “-omics” experiments: microarrays, protein mass spectrometry, DNA sequencing – really, whatever you like, but it’s all aimed at answering the same question. Namely: which of these biological entities (transcripts, proteins, metabolites) are markers for various human disease states?

Somehow, we have to pull all of these data into a common framework so that it can be analysed using statistics. The problem: whilst a lot of effort has gone into designing schema and ontologies that describe the individual data types, less effort has been applied to the question: what do all these things have in common?

Let’s think for a moment about what that means, practically. If I want to integrate sequence data – say, chromosomes, transcripts and microarray probes, I might go to the UCSC genome database or to Ensembl, from where I could fetch XML via DAS, parse it and store the data in a local database. I might then go to GEO or ArrayExpress, fetch microarray assay data and add that to my local store. Similarly, there are XML formats for proteomics and ontologies such as the Sequence Ontology, that I might incorporate.

That’s a lot of work and a lot of things that I have to know. I have to understand each schema/ontology, write parsers to process them and figure out what the relationships in my local datastore should look like. All that before I even fetch local records, pipe them into local code and generate useful analyses. It’s all too easy to become bogged down in the details of things like database table design and parser code.

I’ve recently experienced either an epiphany or a trivial, dumb idea. Which is this: data integration is so problematic because we have made life too complicated. We’ve become so wrapped up in the minutiae of individual types of biological data that we’ve forgotten how it all fits together – or the point of collecting it all in the first place.

So here’s my idea. In order to integrate data, we need to devise a really simple description of an experiment – and it should apply to almost any experiment. My current model is called the feature-probe-value approach. It goes like this:

  • The feature is the biological entity that we want to measure
  • The probe is an entity that maps to the feature and measures it, indirectly
  • The value is the measurement reported by the probe

A microarray is an obvious example: feature = transcript, probe = labelled oligonucleotide, value = intensity. It works for proteomics too: feature = protein, probe = peptide, value = m/z. In fact, a lot of biological data involves the indirect measurement of a “real-world” entity: transcript, protein, metabolite – using a probe, which returns a value.

My hope is that storing data using this simple model should make it much easier to mine and analyse. For example, if the level of transcript X shows an interesting or significant change under condition A, you want to be able to ask: how about condition B? Or C? Or any other experiment involving that transcript (= feature)? Or experiments involving the product of that transcript? It should be possible to retrieve values across related features and assess them rapidly: they go up, or down, or are unchanged, given these conditions.

My next challenge is to implement some of this and see how it works out. I’ll keep you posted.

Update: a lively and enjoyable FriendFeed thread ensues

9 thoughts on “Has our quest for completeness made things too complicated?

  1. Joel Dudley

    This is pretty much how I set up my integrative genomics pipeline at Stanford. It’s much better to integrate on a simple features across as many data types as possible than it is to create the “perfect” data integration system using complex ontologies, automated reasoning, or similar approach if you ask me (not that such approaches are not useful in certain cases).

  2. Greg Tyrelle

    I’m sure you’re aware that you’ve just described a model using *triples*. Which means you could start storing these kinds of simple relationships in a triple store like virtuoso. I would like to see some online collaboration around these ideas. It would be nice to have a set of related publicly accessible datasets for people to work with. For example a reference data set from a microarray (GE0), proteomics, sequencing experiment etc. all related to some kind of disease.

    I’m also interested in seeing a simple web interface build so more relationships can be added in an ad-hoc fashion. Anyway, I’m interested to know how your ideas evolve as you experiment with data integration using this model. Data integration still remains a critical problem for doing this kind of biomarker research.

    1. nsaunders Post author

      Yes, a couple of observers have pointed out that this looks like RDF triples. I agree – but that was not really my intention. I’ve just highlighted what I think are the 3 key terms/fields/primary IDs/keys – whatever – that bind experimental data together.

      What I’m getting at is that I’d like to see people thinking more about the nature of biological data relationships and less about the technical implementations.

  3. Cloud

    I agree with your general point about the need to think more generally about data relationships.

    Your proposed data model strikes me as a specialized case of the entity-attribute-value general data model. If you’re so inclined, you may want to check out the literature on that model for ideas about possible issues you might encounter in the implementation.

    1. Eric Milgram

      I was going to suggest the EAV design pattern, which is what came to mind when I read the post, so I’m glad that I read through the comments first.

      Although data sets have grown ever larger, computing power and disk storage capacities have grown too. Even just several years ago, a generic EAV type schema would not have been practical.

      When designing an informatics system, you always have to make tradeoffs between flexibility and performance, taking into account what you are doing today when weighed against what you think you’ll be doing in the future.

      The level of computing power available to the average person today is just mind boggling (to me anyway). For example, a few months back, I built a complete home server with a 3 GHz quad core CPU, 16 GB RAM, and 8 TB RAID for less than $2,000, which included everything (e.g. monitor, case, wireless mouse/KB, etc). Thinking of my home system into the context of an R&D budget, you can build a smoking system.

      From my view, as informatics in most organizations has become its own monster, there’s a growing divide between the researchers who need informatics, the IT group who’s responsible for hardware aspects, and the informatics folks themselves.

  4. Chris Lasher

    Neil, I thought the post was interesting. I’d like to point out the issue of inconsistent identification of biological molecules will still frustrate the questions worth asking, e.g., “[Can you give me] any other experiment involving that transcript… Or experiments involving the product of that transcript?” When one must map GenBank IDs to Affymetrix IDs mapped to Uniprot IDs mapped to human orthologs identified by HUGO IDs, one still has a serious obstacle orthogonal to data representation to overcome.

    1. nsaunders Post author

      Absolutely – mapping IDs is always going to be an issue. I guess the best approach is to get the relevant mappings into a local database. A significant amount of fetching and parsing but hopefully, a one-off job.

  5. Maximilian

    well, insitu images don’t really fit your pattern… but they are not really data anyways…

    1. nsaunders Post author

      We had some discussion in the FriendFeed thread about what constitutes raw data and how that should be captured. I’d agree that images are not data in the sense of “important numbers that I want to manipulate”. But I think they can be made to fit this pattern. You have to ask “what features in the image are relevant to my study?” And then you have: feature, reporter (e.g. intensity), value. I really believe that different data types have more commonality than is often realised, once you ask the question “why were these data collected?”

Comments are closed.