Warning: contains murky, somewhat unstructured thoughts on large-scale biological data analysis
Picture this. It’s based on a true story: names and details altered.
Alice, a biomedical researcher, performs an experiment to determine how gene expression in cells from a particular tissue is altered when the cells are exposed to an organic compound, substance Y. She collates a list of the most differentially-expressed genes and notes, in passing, that the expression of Gene X is much lower in the presence of substance Y.
Bob, a bioinformatician in the same organisation but in a different city to Alice, is analysing a public dataset. This experiment looks at gene expression in the same tissue but under different conditions: normal compared with a disease state, Z Syndrome. He also notes that Gene X appears in his list – its expression is much higher in the diseased tissue.
Alice and Bob attend the annual meeting of their organisation, where they compare notes and realise the potential significance of substance Y in suppressing the expression of Gene X and so perhaps relieving the symptoms of Z syndrome. On hearing this the head of the organisation, Charlie, marvels at the serendipitous nature of the discovery. Surely, he muses, given the amount of publicly-available experimental data, there must be a way to automate this kind of discovery by somehow “cross-correlating” everything with everything else until patterns emerge. What we need, states Charlie, is:
Algorithms running day and night, crunching all of that data
What’s Charlie missing?
Put simply, what Charlie is missing is that right now, we can’t do this. Or at least, it’s so difficult and time-consuming as to be effectively impossible. Here are some of the reasons why:
If your experiment has identified a gene of interest, the next question is often: who else has seen that gene and under what conditions? Ideally, you’d like to search using a single identifier as a query and retrieve “everything” related to that query.
Two problems here. First, different databases use different identifiers for the same object (or objects derived from that object, such as multiple transcripts). Second, the same object may have multiple names (synonyms) even in the same database.
Ideally, a data source knows about all these related terms and returns results for a query using any of them. In practice – it varies. Services such as BioMart and the UCSC genome/table browser are a step in the right direction, but most biological data providers lag well behind.
Frequently, there are multiple databases for the same kind of data. Gene expression data, for example, can be found in both GEO and ArrayExpress. Should you use one, the other or both? If you choose both, be aware that the resources have very little in common, in terms of the code that you will have to write to retrieve, format and integrate the data from them.
Poor or non-existent APIs
Getting data out of biological databases and onto your machine(s) for computation is a big problem. Where APIs exist they are often poorly-designed with unrealistic limits to what can be retrieved. Or worse, they just don’t work. Fetching large (multi-gigabyte or more) amounts of biological data “over the wire” via interchange formats is at present, not a very realistic option. Great for Twitter, not so useful for a chromosome.
Which leads nicely into…
…Too much data to compute locally, no option to compute remotely
In the past, a common approach to obtaining public data was to download it from an FTP site. Files may be available for download in some standard format (e.g. CSV) which is relatively easy to process and compute. Otherwise, the researcher has to invest time in writing parsers. It was not uncommon for institutions to mirror FTP sites.
Today, as archives accumulate many gigabytes or terabytes of data, this is no longer practicable. Download times, even with the fastest internet connections are many hours or even days – by which time, the data at the remote end have been updated. What we really need is to run the computation at the remote end, where the data are. This is the thinking behind cloud computing solutions such as Amazon EC2. There’s some progress, but someone still has to put the data and the tools in the right place.
There is, in theory, a solution to at least some of these problems. Some people call it the Linked Data Web. In this scenario, all we need to get started is a specific identifier for an item of biological data and the Web will find all of the connections for us. That won’t solve the problem of which computational procedures to run on the returned data, or how to deploy those procedures, but it’s a good start.
The main problem with this solution: the linked data web does not yet exist. We – meaning data providers – have to create it. And it’s just not clear to me how we go about reformatting and curating all of the public biological data that we already have, let alone that which we’re generating right now or will generate in the future. Particularly when most researchers are obsessed with generating data, yet have no interest in how to manage it.