We do not wish to share

November 16, 2018November 16, 2018 / nsaunders

The article Cytotoxic T cells modulate inflammation and endogenous opioid analgesia in chronic arthritis contains a statement that I don’t recall seeing before:

Availability of data and materials

We do not wish to share our data at this moment.

Continue reading →

To do analysis stuff

August 2, 2017 / nsaunders

First there was “insert statistical method here“. Now we have R – making it easy “to do analysis stuff“.

bhattacharyya2017.pdf

Via Elisabeth; I’ll hand you over now for an entertaining summary.

To be fair, analysis stuff describes my working life quite well.

Evidence for a limit to effective peer review

December 18, 2016 / nsaunders / 3 Comments

I missed it first time around but apparently, back in October, Nature published a somewhat-controversial article: Evidence for a limit to human lifespan. It came to my attention in a recent tweet:

Just wow https://t.co/fupXIOAC43 pic.twitter.com/vsxT3VyTg6

— Nick Loman (@pathogenomenick) December 11, 2016

The source: a fact-check article from Dutch news organisation NRC titled “Nature article is wrong about 115 year limit on human lifespan“. NRC seem rather interested in this research article. They have published another more recent critique of the work, titled “Statistical problems, but not enough to warrant a rejection” and a discussion of that critique, Peer review post-mortem: how a flawed aging study was published in Nature.

Unfortunately, the first NRC article does itself no favours by using non-comparable x-axis scales for its charts and not really explaining very well how the different datasets (IDL and GRG) were used. Data nerds everywhere then, are wondering whether to repeat the analysis themselves and perhaps fire off a letter to Nature.

Read the rest…

An Analysis of Contributions to PubMed Commons

December 2, 2016March 15, 2018 / nsaunders

I recently saw a tweet floating by which included a link to some recent statistics from PubMed Commons, the NCBI service for commenting on scientific articles in PubMed. Perhaps it was this post at their blog. So I thought now would be a good time to write some code to analyse PubMed Commons data.

The tl;dr version: here’s the Github repository and the report.

For further details and some charts, read on.

Read the rest…

Data corruption using Excel: 12+ years and counting

August 25, 2016October 6, 2016 / nsaunders / 2 Comments

Why, it seems like only 12 years since we read Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics.

And can it really be 4 years since we reviewed the topic of gene name corruption in Gene name errors and Excel: lessons not learned?

Well, here we are again in 2016 with Gene name errors are widespread in the scientific literature. This study examined 35 175 supplementary Excel data files from 3 597 published articles. Simple yet clever, isn’t it. I bet you wish you’d thought of doing that. I do. The conclusion: about 20% of the articles have associated data files in which gene names have been corrupted by Excel.

What if there is no tomorrow? There wasn’t one today.

We tell you not to use Excel. You counter with a host of reasons why you have to use Excel. None of them are good reasons. I don’t know what else to say. Except to reiterate that probably 80% or more of the data analyst’s time is spent on data cleaning and a good proportion of the dirt arises from avoidable errors.

Variants + Spark = VariantSpark

December 29, 2015 / nsaunders / 2 Comments

Just a short note to alert you to a publication with my name on it. Great work by lead author and former colleague Aidan; I just did “the Gephi stuff”. If you’re interested in bioinformatics applications of Apache Spark, take a look at:

VariantSpark: population scale clustering of genotype information

Happy to report it is open access.

Novelty: an update

October 21, 2015March 16, 2018 / nsaunders

A recent tweet:

@neilfws I enjoyed this: https://t.co/ynyHRbgpLN Have you published (or are you thinking about publishing) this analysis anywhere?

— Marcus Munafo (@MarcusMunafo) October 7, 2015

PubMed articles containing “novel” in title or abstract 1845 – 2014

made me think (1) has it really been 5 years, (2) gee, my ggplot skills were dreadful back then and (3) did I really not know how to correct for the increase in total publications?

So here is the update, at Github and the report.

“Novel” findings, as judged by the usage of that word in titles and abstracts really have undergone a startling increase since about 1975. Indeed, almost 7.2% of findings were “novel” in 2014, compared with 3.2% for the period 1845 – 2014. That said, if we plot using a log scale as suggested by Tal on the original post, the rate of usage appears to be slowing down. See image, right (click for larger version).

As before, none of this is novel.

Just how many retracted articles are there in PubMed anyway?

March 20, 2015March 22, 2015 / nsaunders / 3 Comments

I am forever returning to PubMed data, downloaded as XML, trying to extract information from it and becoming deeply confused in the process.

Take the seemingly-simple question “how many retracted articles are there in PubMed?”
Continue reading →

Note to journals: “methodologically sound” applies to figures too

March 18, 2015March 18, 2015 / nsaunders / 3 Comments

PeerJ, like PLoS ONE, aims to publish work on the basis of “soundness” (scientific and methodological) as opposed to subjective notions of impact, interest or significance. I’d argue that effective, appropriate data visualisation is a good measure of methodology. I’d also argue that on that basis, Evolution of a research field – a micro (RNA) example fails the soundness test.
Continue reading →

Measuring quality is hard

November 13, 2014November 13, 2014 / nsaunders / 5 Comments

Four articles. Click on the images for larger versions.

Exhibit A: the infamous “(insert statistical method here)”. Exhibit B: “just make up an elemental analysis“. Exhibit C: a methods paper in which a significant proportion of the text was copied verbatim from a previous article. Finally, exhibit D, which shall be forever known as the “crappy Gabor” paper.

Exhibit A

Exhibit B

Exhibit C

Exhibit D

Notice anything?
I think that altmetrics are a great initiative. So long as we’re clear that what’s being measured is attention, not quality.

What You're Doing Is Rather Desperate

Notes from the life of a [data] scientist

publications