The article Cytotoxic T cells modulate inflammation and endogenous opioid analgesia in chronic arthritis contains a statement that I don’t recall seeing before:
Availability of data and materials
We do not wish to share our data at this moment.
I missed it first time around but apparently, back in October, Nature published a somewhat-controversial article: Evidence for a limit to human lifespan. It came to my attention in a recent tweet:
The source: a fact-check article from Dutch news organisation NRC titled “Nature article is wrong about 115 year limit on human lifespan“. NRC seem rather interested in this research article. They have published another more recent critique of the work, titled “Statistical problems, but not enough to warrant a rejection” and a discussion of that critique, Peer review post-mortem: how a flawed aging study was published in Nature.
Unfortunately, the first NRC article does itself no favours by using non-comparable x-axis scales for its charts and not really explaining very well how the different datasets (IDL and GRG) were used. Data nerds everywhere then, are wondering whether to repeat the analysis themselves and perhaps fire off a letter to Nature.
I recently saw a tweet floating by which included a link to some recent statistics from PubMed Commons, the NCBI service for commenting on scientific articles in PubMed. Perhaps it was this post at their blog. So I thought now would be a good time to write some code to analyse PubMed Commons data.
For further details and some charts, read on.
Why, it seems like only 12 years since we read Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics.
And can it really be 4 years since we reviewed the topic of gene name corruption in Gene name errors and Excel: lessons not learned?
Well, here we are again in 2016 with Gene name errors are widespread in the scientific literature. This study examined 35 175 supplementary Excel data files from 3 597 published articles. Simple yet clever, isn’t it. I bet you wish you’d thought of doing that. I do. The conclusion: about 20% of the articles have associated data files in which gene names have been corrupted by Excel.
We tell you not to use Excel. You counter with a host of reasons why you have to use Excel. None of them are good reasons. I don’t know what else to say. Except to reiterate that probably 80% or more of the data analyst’s time is spent on data cleaning and a good proportion of the dirt arises from avoidable errors.
Just a short note to alert you to a publication with my name on it. Great work by lead author and former colleague Aidan; I just did “the Gephi stuff”. If you’re interested in bioinformatics applications of Apache Spark, take a look at:
Happy to report it is open access.
A recent tweet:made me think (1) has it really been 5 years, (2) gee, my ggplot skills were dreadful back then and (3) did I really not know how to correct for the increase in total publications?
“Novel” findings, as judged by the usage of that word in titles and abstracts really have undergone a startling increase since about 1975. Indeed, almost 7.2% of findings were “novel” in 2014, compared with 3.2% for the period 1845 – 2014. That said, if we plot using a log scale as suggested by Tal on the original post, the rate of usage appears to be slowing down. See image, right (click for larger version).
As before, none of this is novel.
I am forever returning to PubMed data, downloaded as XML, trying to extract information from it and becoming deeply confused in the process.
Take the seemingly-simple question “how many retracted articles are there in PubMed?”
PeerJ, like PLoS ONE, aims to publish work on the basis of “soundness” (scientific and methodological) as opposed to subjective notions of impact, interest or significance. I’d argue that effective, appropriate data visualisation is a good measure of methodology. I’d also argue that on that basis, Evolution of a research field – a micro (RNA) example fails the soundness test.
Four articles. Click on the images for larger versions.
Exhibit A: the infamous “(insert statistical method here)”. Exhibit B: “just make up an elemental analysis“. Exhibit C: a methods paper in which a significant proportion of the text was copied verbatim from a previous article. Finally, exhibit D, which shall be forever known as the “crappy Gabor” paper.
I think that altmetrics are a great initiative. So long as we’re clear that what’s being measured is attention, not quality.