Big data: shoot first, ask questions later

The terms “big science” and “big data” have recently become quite prominent on the Web. For commentary, I point you to the man with the tag.

There are those who believe that big data means fundamental change in how science is done. We’ll take all this data, make it machine-readable, put it in the cloud and – poof! – science will emerge. Almost as if it were self-aware. At the other extreme are those who see no fundamental difference in how we go about our business – there’ll just be “more” of it.

One analysis, of course, is that they’re both right and they’re both wrong.

There’s a word that I expected to see much more frequently than I did as the arguments flew back and forth. That word is questions. Science is, fundamentally, the business of asking questions. When we don’t know very much, we ask basic questions: why is the sky blue? As we learn more, questions get more specific. Knowing that cells divide we ask: how do they know when to start? And stop? And what happens when those signals go wrong? Pretty soon we’re asking extremely specific questions, such as “what are the mechanisms of E2-mediated down-regulation of the BTG2 gene?” Is it the great irony of our age that as the data get bigger, our questions get smaller? I digress…

Data, no matter how “big”, without questions are inert. They just sit there. Great science arises out of smart questions. The difference with big data is that (1) we can think up questions that might once have been thought impractical – how does the expression of every gene in my organism alter under these conditions? and (2) we need smart ways to ask and answer the questions – meaning technology and computation.

Hence the title of this post, which I think I’d summarise like this:

We used to ask questions, then generate the data.
Now we generate the data, then think of the questions.

13 thoughts on “Big data: shoot first, ask questions later

  1. Neil

    Wonderfully put, and therein lies the challenge. In a lot of my slides I put a big “?” in the middle of the slide, because in the end that’s the goal, asking and answering questions. I still think we need to ask, what data do we need to answer the questions. The difference is that the data probably exists somewhere

  2. Neil, I’ve been thinking about the same thing recently and had a blog post brewing about top-down vs. bottom-up science – exactly your summary. The key thing, which you mention, is that data is not good or bad in and of itself, it’s what you do with it. When you can use it to ask fundamentally different types of questions, it gets really interesting. And that’s what people are doing now. Like Deepak said, being able to then find or generate additional data to answer those questions is still what’s going to end up adding to our knowledge of the world. But I think people who decry big data are missing the point.

  3. Your first main point seems to be that somehow ‘big data’ allows us to ask new questions, but I don’t see how this is especially novel: technological advances have always allowed new questions to be asked (telescope, microscope…). Big data sets and fast computers are neat technological advances, but I’ve yet to see a convincing argument that they have changed the philosophy of science. And I certainly don’t see how the questions we ask now are different in some ‘fundamental’ way.

    And I don’t think the idea that in the past questions used to precede data collection is accurate. Brahe built his astronomical data-set before Kepler started asking the questions that led him to formulate his laws. Folks were keeping weather records before ENSO was discovered – it was the fact the records were kept that allowed people to notice ENSO ‘teleconnections’ in the first place.

    So, what’s changed exactly?

  4. I don’t see how this is especially novel
    Right. I didn’t say it was novel.

    I certainly don’t see how the questions we ask now are different in some ‘fundamental’ way

    Right again. My point was that it’s questions that are important, not that their nature has changed.

    I don’t think the idea that in the past questions used to precede data collection is accurate

    You quote a couple of specific examples to the contrary. I have no idea what the proportion of “questions before data” is compared with “questions after data” over history. I’d contend that most science still begins with a specific query, then collects data to answer it.

    So, what’s changed exactly?
    Everything and nothing, depending on who you read. And note that this discussion is with respect to life sciences, where many people are still getting their head around what big data means.

  5. It’s not just the quantity of data. In the life sciences, big data is a combination of volume, diversity, complexity, and novelty. New technologies = new types of data, with more throughput than ever before. But as good scientists you still need to be able to ask questions. So you have two challenges. Adjusting your thinking to be able to ask questions of complex data (for which you might not have the right tools), and second is a data management problem. How do you manage your data, so you can ask the right questions?

  6. OK maybe things are different in the life sciences. But in the atmospheric sciences (and in the bits of life science that we interact with like ocean biogeochemical modelling) big data and big computers are old news. The collection, management and curation of oceanographic and atmospheric data is an industry in itself. So, these big datasets exist, and then we think of a question, and go and get, or look at, the part of the data we need to answer our question. I guess I’m sensitive about the ‘big data’ thing ever since I read Anderson’s article in Wired.

  7. Pingback: Paying attention to simulation : business|bytes|genes|molecules

  8. Pingback: Michael Nielsen » Biweekly links for 02/02/2009

  9. Liked by 19 people: Stephen Anthony, Bill Hooker, Pawel Szczesny, Mr. Gunn, Yann Abraham, Chris Lasher, Deepak, Fitzgerald Steele, Michael Nielsen, D0r0th34, Richard Akerman, Jan Aerts, Simon Cockell, Jonathan Eisen, Allyson Lister, Duncan, Cameron Neylon, Andrew Perry and Lars Juhl Jensen

    Andrew Perry said:
    Nice post Neil, helps distill some of the ideas nicely for a general audience. I’m a big fan of ‘implement first, see what happens, refine, repeat’ . If large datasets are organised and made freely accessible people will certainly use them to ask crazy new questions that no one ever thought of.

    Lars Juhl Jensen said:
    One of the best posts I’ve read in a long time!

    Cameron Neylon said:
    Neil – you really need to stop capturing the ideas that I am completely failing to get my head around into witty posts with insightful soundbites.

    Deepak said:
    What they all said

    D0r0th34 said:
    I’m watching Andrew Perry’s passive voice with interest. :) The eternal question: who bells cat?

    Christina Pikas said:
    is it easier, though, to take and keep data you don’t need, and to do mathematical or computation manipulations just because you can, and not because they provide information about the world? And does this matter?

    Deepak said:
    The point is that you still need to ask questions and different people will ask different questions from the same data. Plus you need to figure out which data you want to use or what you want to bring together. All that really doesn’t change. “How” might. “Why” usually doesn’t

    Andrew Perry said:
    D0r0th34: The Microsoft Word grammer checker used to always warn me about that passive voice :) If I generated and controlled very large datasets, I’d be definately be working to “bell that cat”, but I’m sure it’s easier to say than do. Working toward sharing ‘small data’ is hard enough (in fact, I’d argue that currently the “effort to openly share vs. overall benefits” ratio is better for ‘big data’ compared with ‘small data’)

    Deepak said:
    Christina, the challenge is while they might not provide information today, they might tomorrow, and perhaps more importantly, you don’t know what information they provide until you look

    submitted using ff2wp: e14965a1-04c5-3ed0-623d-ed36ebbf39d2 2009-01-28T07:56:24Z

  10. Pingback: Coast to Coast Bio Podcast » Blog Archive » Episode 11: Arguing big data and bioinformatics skills

  11. Ahh, great read. The novel “rainbows end” runs with this idea.

    Me, I want to model health in Spinal Cord Injury. So, we need massive amounts of information like, heart and breathing rates. Fine, put on sensors and stream into Google spreadsheets… but.. this data is useless without user annotation. How to ask the user the right questions at the right time without bugging them too much?


Comments are closed.