I’ve had a half-formed, but not very interesting blog post in my head for some months now. It’s about a conversation I had with a PhD student, around 10 years ago, after she went to a bioinformatics talk titled “Excel is not a database” and how she laughed as I’d been telling her that “for years already”. That’s basically the post so as I say, not that interesting, except as an illustration that we’ve been talking about this stuff for a long time (and little has changed).
HEp-2 or not HEp2?
Anyway, we have something better. I was exploring PubMed Commons
, which is becoming a very good resource. The top-featured comment
looks very interesting (see image, right).
Intrigued, I went to investigate the Database of Cross-contaminated or Misidentified Cell Lines, hovered over the download link and saw that it’s – wait for it – a PDF. I’ll say that again. The “database” is a PDF.
The sad thing is that this looks like very useful, interesting information which I’m sure would be used widely if presented in an appropriate (open) format and better-publicised. Please, biological science, stop embarrassing yourself. If you don’t know how to do data properly, talk to someone who does.
I’m pleased to announce an open-access publication with my name on it:
Mitchell, S.M., Ross, J.P., Drew, H.R., Ho, T., Brown, G.S., Saunders, N.F.W., Duesing, K.R., Buckley, M.J., Dunne, R., Beetson, I., Rand, K.N., McEvoy, A., Thomas, M.L., Baker, R.T., Wattchow, D.A., Young, G.P., Lockett, T.J., Pedersen, S.K., LaPointe L.C. and Molloy, P.L. (2014). A panel of genes methylated with high frequency in colorectal cancer. BMC Cancer 14:54.
So, I read the title:
Mining locus tags in PubMed Central to improve microbial gene annotation
and skimmed the abstract:
The scientific literature contains millions of microbial gene identifiers within the full text and tables, but these annotations rarely get incorporated into public sequence databases.
and thought, well OK, but wouldn’t it be better to incorporate annotations in the first place – when submitting to the public databases – rather than by this indirect method?
The point, of course, is to incorporate new findings from the literature into existing records, rather than to use the tool as a primary method of annotation. I do believe that public databases could do more to enforce data quality standards at deposition time, but that’s an entirely separate issue.
Big thanks to Michael Hoffman for a spirited Twitter discussion that put me straight.
On a rare, brief holiday (here and here, if you’re interested; both highly-recommended), I make the mistake of checking my Twitter feed:
This points me to BoxPlotR. It draws box plots. Using Shiny Server. That’s the “innovation”, presumably.
With “quilt plots” and now this, I’m starting to think that I’ve been doing science wrong all these years. If I’d been told to submit the trivial computational work I do every single day to journals, I could have thousands of publications by now.
I’m still pretty relaxed post-holiday, so let’s just leave it there.
A “quilt plot”
Quilt plots. Sounds interesting. The link points to a short article in PLoS ONE
, containing a table and a figure. Here is Figure 1.
If you looked at that and thought “Hey, that’s a heat map!”, you are correct. That is a heat map. Let’s be quite clear about that. It’s a heat map.
So, how do the authors justify publishing a method for drawing heat maps and then calling them “quilt plots”?
Read the rest…
Reading an interesting post at Genomes Unzipped, “Human genetics is microbial genomics“, which states:
Only 10% of cells on your “human” body are human anyway, the rest are microbial.
Have you read a sentence like that before? So have I. So has a reader who left a comment:
I was wondering if you have a source for “Only 10% of cells on your “human” body are human anyway, the rest are microbial”
It’s a good question. Everyone quotes this figure, almost no-one provides a reference. Let’s go in search of one.
Read the rest…
You can guarantee that when scientists publish a study titled:
Determining the Presence of Periodontopathic Virulence Factors in Short-Term Postmortem Alzheimer’s Disease Brain Tissue
a newspaper will publish a story titled:
Poor dental health and gum disease may cause Alzheimer’s
Without access to the paper, it’s difficult to assess the evidence. I suggest you read Jonathan Eisen’s analysis of the abstract. Essentially, it makes two claims:
- that cultured astrocytes (a type of brain cell) can adsorb and internalize lipopolysaccharide (LPS) from Porphyromonas gingivalis, a bacterium found in the mouth
- that LPS was also detected in brain tissue from 4/10 Alzheimer’s disease (AD) cases, but not in tissue from 10 matched normal brains
Regardless of the biochemistry – which does not sound especially convincing to me – how about the statistics?
Read the rest…
Just how many (bad) -omics are there anyway? Let’s find out.
Update: code and data now at Github
Read the rest…
Here’s a tip. When you write an article about your software, the title of which indicates that open-source is important:
A universal open-source Electronic Laboratory Notebook
but you then:
- provide almost no details in the abstract
- do not provide a link to a website or repository from which your “free” software can be obtained
- choose not to make the article open access
- and put the installation instructions in a supplementary data file which is also not open access
Don’t be surprised when no-one uses your software.
Or is the publication more important to you than the product?
File under “interesting articles that I don’t have time to write about at length.”
- Archaea and Fungi of the Human Gut Microbiome: Correlations with Diet and Bacterial Residents
Long ago, before metagenomics and NGS, I did a little work on detection of Archaea in human microbiomes. There’s a blog post in the pipeline about that but until then, enjoy this article in PLoS ONE.
- Mutational heterogeneity in cancer and the search for new cancer-associated genes
This article is getting a lot of attention on Twitter this week. Brief summary: cancer cells are really messed up in all sorts of ways, most of which are not causal with respect to the cancer. Anyone who has ever looked at microarray data knows that it’s not uncommon for 50% or more of genes to show differential expression in a cancer/normal comparison, so this is hardly a new concept. I think we need to move away from ever-more detailed characterizations of the ways in which cancer cells are “messed up.” We know that they are and that doesn’t provide much insight, in my opinion.
- The vast majority of statistical analysis is not performed by statisticians
Interesting post by Jeff Leek, summarized very well by its title. It points out that many more people are now interested in data analysis, many of them are not trained professionally as statisticians (I’m in this category myself) and we need to recognize and plan for that.
Bonus post doing the rounds of social media: Using Metadata to Find Paul Revere. Social network analysis, 18th-century style. Amusing, informative and topical.