Why, it seems like only 12 years since we read Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics.
And can it really be 4 years since we reviewed the topic of gene name corruption in Gene name errors and Excel: lessons not learned?
Well, here we are again in 2016 with Gene name errors are widespread in the scientific literature. This study examined 35 175 supplementary Excel data files from 3 597 published articles. Simple yet clever, isn’t it. I bet you wish you’d thought of doing that. I do. The conclusion: about 20% of articles have associated data files in which gene names have been corrupted by Excel.
What if there is no tomorrow? There wasn’t one today.
We tell you not to use Excel. You counter with a host of reasons why you have to use Excel. None of them are good reasons. I don’t know what else to say. Except to reiterate that probably 80% or more of the data analyst’s time is spent on data cleaning and a good proportion of the dirt arises from avoidable errors.
Just a short note to alert you to a publication with my name on it. Great work by lead author and former colleague Aidan; I just did “the Gephi stuff”. If you’re interested in bioinformatics applications of Apache Spark, take a look at:
VariantSpark: population scale clustering of genotype information
Happy to report it is open access.
A recent tweet:
PubMed articles containing “novel” in title or abstract 1845 – 2014
made me think (1) has it really been 5 years, (2) gee, my ggplot skills were dreadful back then and (3) did I really not know how to correct for the increase in total publications?
So here is the update, at Github and a document at RPubs.
“Novel” findings, as judged by the usage of that word in titles and abstracts really have undergone a startling increase since about 1975. Indeed, almost 7.2% of findings were “novel” in 2014, compared with 3.2% for the period 1845 – 2014. That said, if we plot using a log scale as suggested by Tal on the original post, the rate of usage appears to be slowing down. See image, right (click for larger version).
As before, none of this is novel.
I am forever returning to PubMed data, downloaded as XML, trying to extract information from it and becoming deeply confused in the process.
Take the seemingly-simple question “how many retracted articles are there in PubMed?”
PeerJ, like PLoS ONE, aims to publish work on the basis of “soundness” (scientific and methodological) as opposed to subjective notions of impact, interest or significance. I’d argue that effective, appropriate data visualisation is a good measure of methodology. I’d also argue that on that basis, Evolution of a research field – a micro (RNA) example fails the soundness test.
Four articles. Click on the images for larger versions.
Exhibit A: the infamous “(insert statistical method here)”. Exhibit B: “just make up an elemental analysis“. Exhibit C: a methods paper in which a significant proportion of the text was copied verbatim from a previous article. Finally, exhibit D, which shall be forever known as the “crappy Gabor” paper.
I think that altmetrics are a great initiative. So long as we’re clear that what’s being measured is attention, not quality.
I’ve had a half-formed, but not very interesting blog post in my head for some months now. It’s about a conversation I had with a PhD student, around 10 years ago, after she went to a bioinformatics talk titled “Excel is not a database” and how she laughed as I’d been telling her that “for years already”. That’s basically the post so as I say, not that interesting, except as an illustration that we’ve been talking about this stuff for a long time (and little has changed).
HEp-2 or not HEp2?
Anyway, we have something better. I was exploring PubMed Commons
, which is becoming a very good resource. The top-featured comment
looks very interesting (see image, right).
Intrigued, I went to investigate the Database of Cross-contaminated or Misidentified Cell Lines, hovered over the download link and saw that it’s – wait for it – a PDF. I’ll say that again. The “database” is a PDF.
The sad thing is that this looks like very useful, interesting information which I’m sure would be used widely if presented in an appropriate (open) format and better-publicised. Please, biological science, stop embarrassing yourself. If you don’t know how to do data properly, talk to someone who does.
I’m pleased to announce an open-access publication with my name on it:
Mitchell, S.M., Ross, J.P., Drew, H.R., Ho, T., Brown, G.S., Saunders, N.F.W., Duesing, K.R., Buckley, M.J., Dunne, R., Beetson, I., Rand, K.N., McEvoy, A., Thomas, M.L., Baker, R.T., Wattchow, D.A., Young, G.P., Lockett, T.J., Pedersen, S.K., LaPointe L.C. and Molloy, P.L. (2014). A panel of genes methylated with high frequency in colorectal cancer. BMC Cancer 14:54.
So, I read the title:
Mining locus tags in PubMed Central to improve microbial gene annotation
and skimmed the abstract:
The scientific literature contains millions of microbial gene identifiers within the full text and tables, but these annotations rarely get incorporated into public sequence databases.
and thought, well OK, but wouldn’t it be better to incorporate annotations in the first place – when submitting to the public databases – rather than by this indirect method?
The point, of course, is to incorporate new findings from the literature into existing records, rather than to use the tool as a primary method of annotation. I do believe that public databases could do more to enforce data quality standards at deposition time, but that’s an entirely separate issue.
Big thanks to Michael Hoffman for a spirited Twitter discussion that put me straight.
On a rare, brief holiday (here and here, if you’re interested; both highly-recommended), I make the mistake of checking my Twitter feed:
This points me to BoxPlotR. It draws box plots. Using Shiny Server. That’s the “innovation”, presumably.
With “quilt plots” and now this, I’m starting to think that I’ve been doing science wrong all these years. If I’d been told to submit the trivial computational work I do every single day to journals, I could have thousands of publications by now.
I’m still pretty relaxed post-holiday, so let’s just leave it there.