Problematic cell lines: now in a real database

Back in July, I was complaining about the latest abuse of the word “database” by biologists: the “PDF as database.”

This led to some very productive discussion using PubMed Commons and I’m happy to report that misidentified and contaminated cell lines are now included in the NCBI BioSample database.

As the news release notes, rather alarmingly:

This problem is so common it is thought that thousands of misleading and potentially erroneous papers have been published using cell lines that are incorrectly identified

So it would be useful if there were a direct link between the BioSample record for a cell line and PubMed records in which it was used…
Continue reading

Measuring quality is hard

Four articles. Click on the images for larger versions.

Exhibit A:¬†the infamous “(insert statistical method here)”. Exhibit B: “just make up an elemental analysis“. Exhibit C: a methods paper in which a significant proportion of the text was copied verbatim from a previous article. Finally, exhibit D, which shall be forever known as the “crappy Gabor” paper.

Exhibit A

Exhibit A

Exhibit B

Exhibit B

Exhibit C

Exhibit C

Exhibit D

Exhibit D

Notice anything?
I think that altmetrics are a great initiative. So long as we’re clear that what’s being measured is attention, not quality.

Create your own gene IDs! No wait. Don’t.

Here’s a new way to abuse biological information: take a list of gene IDs and use them to create a completely fictitious, but very convincing set of microarray probeset IDs.

This one begins with a question at BioStars, concerning the conversion of Affymetrix probeset IDs to gene names. Being a “convert ID X to ID Y” question, the obvious answer is “try BioMart” and indeed the microarray platform ([MoGene-1_0-st] Affymetrix Mouse Gene 1.0 ST) is available in the Ensembl database.

However, things get weird when we examine some example probeset IDs: 73649_at, 17921_at, 18174_at. One of the answers to the question notes that these do not map to mouse.

The data are from GEO series GSE56257. The microarray platform is GPL17777. Description: “This is identical to GPL6246 but a custom cdf environment was used to extract data. The cdf can be found at the link below.”

Uh-oh. Alarm bells.
Continue reading

“Health Hack”: crossing the line between hackfest and unpaid labour

I’ve never attended a hackathon (hack day, hackfest or codefest). My impression of them is that there is, generally, a strong element of “working for the public good”: seeking to use code and data in new ways that maximise benefit and build communities.

Which is why I’m somewhat mystified by the projects on offer at the Sydney HealthHack. They read like tenders for consultants. Unpaid consultants.

The projects – a pedigree drawing tool, a workflow to process microscopy images, a statistical calculator and a mutation discovery pipeline – all describe problems that competent bioinformaticians could solve using existing tools in a relatively short time. For example, off the top of my head, ImageJ or CSIRO’s Workspace might be worth looking at for problem (2). The steps described in problem (4) – copy and paste between spreadsheets, manual inspection and manipulation of sequence data – should be depressingly familiar examples to many bioinformaticians. This project can be summarised simply as “you’re doing it wrong because you don’t know any better.”

The overall tone is “my research group requires this tool, but we’re unable to employ anyone to do it.” There is no sense of anything wider than the immediate needs of individual researchers. This does not seem, to me, what hackfest philosophy is all about.

This raises an issue that I think about a lot: how do we (the science community) best get the people with the expertise (in this case, bioinformaticians) to the people with the problems? In an ideal world the answer would be “everyone should employ at least one.” I wonder about the market (Australian or more generally) for paid consulting “biological data scientists”? We complain that we’re under-valued; well, perhaps it is we who are doing the valuation when we offer our skills for free.

Bioinformatics journals: time from submission to acceptance, revisited

Before we start: yes, we’ve been here before. There was the Biostars question “Calculating Time From Submission To Publication / Degree Of Burden In Submitting A Paper.” That gave rise to Pierre’s excellent blog post and code + data on Figshare.

So why are we here again? 1. It’s been a couple of years. 2. This is the R (+ Ruby) version. 3. It’s always worth highlighting how the poor state of publicly-available data prevents us from doing what we’d like to do. In this case the interesting question “which bioinformatics journal should I submit to for rapid publication?” becomes “here’s an incomplete analysis using questionable data regarding publication dates.”

Let’s get it out of the way then.
Continue reading

PubMed Publication Date: what is it, exactly?

File this one under “has troubled me (and others) for some years now, let’s try to resolve it.”

Let’s use the excellent R/rentrez package to search PubMed for articles that were retracted in 2013.


es <- entrez_search("pubmed", "\"Retracted Publication\"[PTYP] 2013[PDAT]", usehistory = "y")
# [1] 117

117 articles. Now let’s fetch the records in XML format.

xml <- entrez_fetch("pubmed", WebEnv = es$WebEnv, query_key = es$QueryKey, 
                    rettype = "xml", retmax = es$count)

Next question: which XML element specifies the “Date of publication” (PDAT)?
Continue reading

Ebola, Wikipedia and data janitors

Sometimes, several strands of thought come together in one place. For me right now, it’s the Wikipedia page “Ebola virus epidemic in West Africa”, which got me thinking about the perennial topic of “data wrangling”, how best to provide public data and why I can’t shake my irritation with the term “data science”. Not to mention Ebola, of course.

I imagine that a lot of people with an interest in biological data are following this story and thinking “how can I visualise the numbers for myself?” Maybe you’d like to reproduce the plots in the Timeline section of that Wikipedia entry. Surprise: the raw numbers are not that easy to obtain.

2014-09-26 note: when Wikipedia pages change, as this one has, code breaks, as this code has; updates maintained at Github
Continue reading

New ways to butcher biological data using Excel

I must have a minor reputation as a critic of Excel in bioinformatics, since strangers are now sending contributions to my work email address. Thanks, you know who you are!

PLOS ONE  Online Survival Analysis Software to Assess the Prognostic Value of Biomarkers Using Transcriptomic Data in Non Small Cell Lung Cancer

When asked why I didn’t mask this email address, I replied “the authors didn’t”

This week: Online Survival Analysis Software to Assess the Prognostic Value of Biomarkers Using Transcriptomic Data in Non-Small-Cell Lung Cancer. Scroll on down to supporting Table S1 and right there on the page, staring you in the face is a rather unusual-looking microarray probeset ID.

I wonder if we should start collecting notable examples in one place?

To be fair, this is more human error than an issue with Excel per se, but I’m going to argue that using Excel promotes sloppy data management errors by making minds lazy :)

Finally, NCBI Genomes recognises Archaea*

I’ve been complaining about this for years. They fixed it. The NCBI have reorganised their genomes FTP site and finally, Archaea are not lumped in with Bacteria.


Archaea are still included in the ASSEMBLY_BACTERIA directory; hopefully that’s next on the list.

[*] to be fair, they’ve always recognised Archaea – just not in a form that makes downloads convenient