Monthly Archives: February 2011

Nature on reproducible research

I imagine that most people, when asked “do you think that independent confirmation of research findings is important?” would answer “yes”. I also imagine, when told that in most cases this is not possible, that those people might be concerned or perhaps incredulous. However, this really is the case, which is why I spend much of my working life in a state of concern and incredulity.

Over the years, many articles have been written on how to improve this state of affairs by adopting best practices, collectively-termed reproducible research. One of the latest is an editorial in Nature. I’ve pulled out a few quotes for discussion.
Read the rest…

Real bioinformaticians write code

A lot of questions at BioStar begin along these lines:

Where can I find…?
I am looking for a resource…?
Is there some database…?

I tweeted some concerns about this:

Many #biostar questions begin “I am looking for a resource..”. The answer is often that you need to code a solution using the data you have.

Chris tweeted back:

@neilfws Lit. or Google search is first step, asking around is the next logical step. (Re-)inventing wheels is last. Worth asking, IMHO.

We had a little chat and I realised that 140 characters or less was not getting my point across (not for the first time). What I was trying to say was something like this.
Read the rest…

Getting “stuff” into MongoDB

One of the aspects I like most about MongoDB is the “store first, ask questions later” approach. No need to worry about table design, column types or constant migrations as design changes. Provided that your data are in some kind of hash-like structure, you just drop them in.

Ruby is particularly useful for this task, since it has many gems which can parse common formats into a hash. Here are 3 quick examples with relevance to bioinformatics.
Read the rest…

Dumped on by data scientists

A story in The Chronicle of Higher Education reminded me that I’ve been meaning to write about “data science” for some time.

The headline to the story:

“Dumped On by Data: Scientists Say a Deluge Is Drowning Research”

Rather amusingly, this is abbreviated in the URL to “Dumped-On-by-Data-Scientists”; a nice example of how the same words, broken in the wrong place, can lead to a completely different meaning.

Anyway, to the point. The term “data scientist” – a good thing, or not?
I’m throwing this one out there because I spent much of 2010 (a) reading articles that used the term and (b) trying to decide whether I like it or not – and I still can’t decide.

Arguments for:

  • It’s an attention-grabber, designed to make us think about the tools and skills required to analyse “big data” in the same way that “NoSQL” is designed to make us think about alternative database solutions

Arguments against:

  • The “data” part is redundant, since all scientists deal with data
  • It belittles the job title of “scientist”; the term might be construed as dismissive of the education, training and skills required to do “boring old school science” as opposed to “new, flashy sexy data science”
  • Many (most?) “data scientists” do business intelligence, not science; crunching Twitter posts to help formulate a better product marketing strategy is not the same as addressing a genuine scientific problem

At the heart of the issue, I feel, is a different approach to data. In “data science” we start with everything, give it a shake and see if answers to our questions fall out. In “real science” we start with a specific question, generate data designed to answer that question and see what falls out. Perhaps they are just different philosophies and mindsets. Perhaps each can learn from the other.

I guess with one “for” and three “against” I’ve decided that I don’t like the term “data scientist”, but I can’t quite shake the feeling that it has some use. What do you think?

Conservative (with a small “c”) research

This is really interesting. I’m reading it at work so I can’t tell you if it’s behind the paywall, but I sincerely hope not; it deserves to be read widely:

Edwards, A.M. et al. (2011)
Too many roads not taken.
Nature 470: 163–165

Most protein research focuses on those known before the human genome was mapped. Work on the slew discovered since, urge Aled M. Edwards and his colleagues.

The article includes some nicely-done bibliometric analysis. I’ve lifted a few quotes that illustrate some of the key points.

  • More than 75% of protein research still focuses on the 10% of proteins that were known before the genome was mapped
  • Around 65% of the 20,000 kinase papers published in 2009 focused on the 50 proteins that were the ‘hottest’ in the early 1990s
  • Similarly, 75% of the research activity on nuclear hormone receptors in 2009 focused on the 6 (of 48) receptors that were most studied in the mid 1990s
  • A common assumption is that previous research efforts have preferentially identified the most important proteins – the evidence doesn’t support this
  • Why the reluctance to work on the unknown? [...] scientists are wont to “fondle their problems”
  • Funding and peer-review systems are risk-averse
  • The availability of chemical probes for a given receptor dictates the level of research interest in it; the development of these tools is not driven by the importance of the protein

I love the phrase “fondle their problems.”

I’ve long felt that academic research has increasingly little to do with “advancing knowledge” and is more concerned with churning out “more of the same” to consolidate individual careers. However, that’s just me being opinionated and anecdotal. What do you think?

Algorithms running day and night

Warning: contains murky, somewhat unstructured thoughts on large-scale biological data analysis

Picture this. It’s based on a true story: names and details altered.

Alice, a biomedical researcher, performs an experiment to determine how gene expression in cells from a particular tissue is altered when the cells are exposed to an organic compound, substance Y. She collates a list of the most differentially-expressed genes and notes, in passing, that the expression of Gene X is much lower in the presence of substance Y.

Bob, a bioinformatician in the same organisation but in a different city to Alice, is analysing a public dataset. This experiment looks at gene expression in the same tissue but under different conditions: normal compared with a disease state, Z Syndrome. He also notes that Gene X appears in his list – its expression is much higher in the diseased tissue.

Alice and Bob attend the annual meeting of their organisation, where they compare notes and realise the potential significance of substance Y in suppressing the expression of Gene X and so perhaps relieving the symptoms of Z syndrome. On hearing this the head of the organisation, Charlie, marvels at the serendipitous nature of the discovery. Surely, he muses, given the amount of publicly-available experimental data, there must be a way to automate this kind of discovery by somehow “cross-correlating” everything with everything else until patterns emerge. What we need, states Charlie, is:

Algorithms running day and night, crunching all of that data

What’s Charlie missing?
Read the rest…

What’s more important: the publication or the product?

The Nature stable of journals. A byword for quality, integrity, impact. Witness this recent offering from Nature Biotechnology:

Bale, S. et al. (2011)
MutaDATABASE: a centralized and standardized DNA variation database.
Nature Biotechnology 29, 117–118

Unfortunately, although it describes an open, public database, the article itself costs $32 to read without subscription (update: it’s freely available as of one day after this post). Not to be deterred, I went to investigate MutaDATABASE itself.

The alarm bells began to ring right there on the index page (see screenshot, right).
Could that be right? I tried several browsers, in case of a rendering problem. Same result – no contents.

There seems to be something missing

Clicking on some of the links in the sidebar, I became more concerned. Here’s an example URL:

I recognise that form of URL – it comes from Joomla, a content management system. I’ve had servers compromised only twice in my career – both times, due to Joomla-based websites. Their security may have improved since, I guess – but this smacks of people looking to build a website quickly without investigating the alternatives.


It will be great. Promise.

Then, there are the spelling/grammatical errors, the “coming soons”, the “under constructions”, the news page not updated in almost 5 months. And as Tim Yates pointed out to me:


@neilfws The mutaDATABASE logo leads me to believe you are right about it being a joke.. is that someone dropping their sequences in a bin?

Who knows, MutaDatabase may turn out to be terrific. Right now though, it’s rather hard to tell. The database and web server issues of Nucleic Acids Research require that the tools described be functional for review and publication. Apparently, Nature Biotechnology does not.