An Analysis of Contributions to PubMed Commons

December 2, 2016March 15, 2018 / nsaunders

I recently saw a tweet floating by which included a link to some recent statistics from PubMed Commons, the NCBI service for commenting on scientific articles in PubMed. Perhaps it was this post at their blog. So I thought now would be a good time to write some code to analyse PubMed Commons data.

The tl;dr version: here’s the Github repository and the report.

For further details and some charts, read on.

Read the rest…

Novelty: an update

October 21, 2015March 16, 2018 / nsaunders

A recent tweet:

@neilfws I enjoyed this: https://t.co/ynyHRbgpLN Have you published (or are you thinking about publishing) this analysis anywhere?

— Marcus Munafo (@MarcusMunafo) October 7, 2015

PubMed articles containing “novel” in title or abstract 1845 – 2014

made me think (1) has it really been 5 years, (2) gee, my ggplot skills were dreadful back then and (3) did I really not know how to correct for the increase in total publications?

So here is the update, at Github and the report.

“Novel” findings, as judged by the usage of that word in titles and abstracts really have undergone a startling increase since about 1975. Indeed, almost 7.2% of findings were “novel” in 2014, compared with 3.2% for the period 1845 – 2014. That said, if we plot using a log scale as suggested by Tal on the original post, the rate of usage appears to be slowing down. See image, right (click for larger version).

As before, none of this is novel.

Exploring the NCBI taxonomy database using Entrez Direct

April 2, 2015April 2, 2015 / nsaunders / 2 Comments

I’ve been meaning to write about Entrez Direct, henceforth called edirect, for some time. This tweet provided me with an excuse:

OK tweeps, I'm looking for estimates of species richness in insects. Any orders, families, or genera with good estimates you know of? pls RT

— Robert Lanfear (@RobLanfear) April 1, 2015

This post is not strictly the answer to that question. Instead we’ll ask: which parent IDs of records for insects in the NCBI Taxonomy database have the most species IDs?
Continue reading →

Just how many retracted articles are there in PubMed anyway?

March 20, 2015March 22, 2015 / nsaunders / 3 Comments

I am forever returning to PubMed data, downloaded as XML, trying to extract information from it and becoming deeply confused in the process.

Take the seemingly-simple question “how many retracted articles are there in PubMed?”
Continue reading →

Problematic cell lines: now in a real database

December 8, 2014 / nsaunders / 5 Comments

Back in July, I was complaining about the latest abuse of the word “database” by biologists: the “PDF as database.”

This led to some very productive discussion using PubMed Commons and I’m happy to report that misidentified and contaminated cell lines are now included in the NCBI BioSample database.

As the news release notes, rather alarmingly:

This problem is so common it is thought that thousands of misleading and potentially erroneous papers have been published using cell lines that are incorrectly identified

So it would be useful if there were a direct link between the BioSample record for a cell line and PubMed records in which it was used…
Continue reading →

PubMed Publication Date: what is it, exactly?

September 24, 2014 / nsaunders / 2 Comments

File this one under “has troubled me (and others) for some years now, let’s try to resolve it.”

Let’s use the excellent R/rentrez package to search PubMed for articles that were retracted in 2013.

library(rentrez)

es <- entrez_search("pubmed", "\"Retracted Publication\"[PTYP] 2013[PDAT]", usehistory = "y")
es$count
# [1] 117

117 articles. Now let’s fetch the records in XML format.

xml <- entrez_fetch("pubmed", WebEnv = es$WebEnv, query_key = es$QueryKey, 
                    rettype = "xml", retmax = es$count)

Next question: which XML element specifies the “Date of publication” (PDAT)?
Continue reading →

Finally, NCBI Genomes recognises Archaea*

August 27, 2014August 27, 2014 / nsaunders

I’ve been complaining about this for years. They fixed it. The NCBI have reorganised their genomes FTP site and finally, Archaea are not lumped in with Bacteria.

GenBank: ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/archaea/
RefSeq:  ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/archaea/

Archaea are still included in the ASSEMBLY_BACTERIA directory; hopefully that’s next on the list.

[*] to be fair, they’ve always recognised Archaea – just not in a form that makes downloads convenient

When is db=all not db=all? When you use Entrez ELink.

April 30, 2014April 30, 2014 / nsaunders

Just a brief technical note.

I figured that for a given compound in PubChem, it would be interesting to know whether that compound had been used in a high-throughput experiment, which you might find in GEO. Very easy using the E-utilities, as implemented in the R package rentrez:

library(rentrez)
links <- entrez_link(dbfrom = "pccompound", db = "gds", id = "62857")
length(links$pccompound_gds)
# [1] 741

Browsing the rentrez documentation, I note that db can take the value “all”. Sounds useful!

links <- entrez_link(dbfrom = "pccompound", db = "all", id = "62857")
length(links$pccompound_gds)
# [1] 0

That’s odd. In fact, this query does not even link pccompound to gds:

length(names(links))
# [1] 39
which(names(links) == "pccompound_gds")
# integer(0)

It’s not a rentrez issue, since the same result occurs using the E-utilities URL.

The good people at ropensci have opened an issue, contacting NCBI for clarification. We’ll keep you posted.

Web scraping using Mechanize: PMID to PMCID/NIHMSID

September 17, 2013September 17, 2013 / nsaunders / 4 Comments

Web services are great. Pass them a URL. Structured data comes back. Parse it, analyse it, visualise it. Done.

Web scraping – interacting programmatically with a web page – is not so great. It requires more code and when the web page changes, the code breaks. However, in the absence of a web service, scraping is better than nothing. It can even be rather satisfying. Early in my bioinformatics career the realisation that code, rather than humans, can automate the process of submitting forms and reading the results was quite a revelation.

In this post: how to interact with a web page at the NCBI using the Mechanize library.

Read the rest…

How to: bulk retrieval of archaeal genome sequences from the NCBI FTP site

May 28, 2013July 7, 2014 / nsaunders / 1 Comment

While we’re on the topic of mistaking Archaea for Bacteria, here’s an issue with the NCBI FTP site that has long annoyed me and one workaround. Warning: I threw this together minutes ago and it’s not fully tested.

Update July 7 2014: NCBI have changed things so code in this post no longer works

Read the rest…

What You're Doing Is Rather Desperate

Notes from the life of a [data] scientist

ncbi

An Analysis of Contributions to PubMed Commons

Novelty: an update

Exploring the NCBI taxonomy database using Entrez Direct

Just how many retracted articles are there in PubMed anyway?

Problematic cell lines: now in a real database

PubMed Publication Date: what is it, exactly?

Finally, NCBI Genomes recognises Archaea*

When is db=all not db=all? When you use Entrez ELink.

Web scraping using Mechanize: PMID to PMCID/NIHMSID

How to: bulk retrieval of archaeal genome sequences from the NCBI FTP site