Warning: contains murky, somewhat unstructured thoughts on large-scale biological data analysis
Picture this. It’s based on a true story: names and details altered.
Alice, a biomedical researcher, performs an experiment to determine how gene expression in cells from a particular tissue is altered when the cells are exposed to an organic compound, substance Y. She collates a list of the most differentially-expressed genes and notes, in passing, that the expression of Gene X is much lower in the presence of substance Y.
Bob, a bioinformatician in the same organisation but in a different city to Alice, is analysing a public dataset. This experiment looks at gene expression in the same tissue but under different conditions: normal compared with a disease state, Z Syndrome. He also notes that Gene X appears in his list – its expression is much higher in the diseased tissue.
Alice and Bob attend the annual meeting of their organisation, where they compare notes and realise the potential significance of substance Y in suppressing the expression of Gene X and so perhaps relieving the symptoms of Z syndrome. On hearing this the head of the organisation, Charlie, marvels at the serendipitous nature of the discovery. Surely, he muses, given the amount of publicly-available experimental data, there must be a way to automate this kind of discovery by somehow “cross-correlating” everything with everything else until patterns emerge. What we need, states Charlie, is:
Algorithms running day and night, crunching all of that data
What’s Charlie missing?
Read the rest…