I recently saw a tweet floating by which included a link to some recent statistics from PubMed Commons, the NCBI service for commenting on scientific articles in PubMed. Perhaps it was this post at their blog. So I thought now would be a good time to write some code to analyse PubMed Commons data.
The tl;dr version: here’s the Github repository and the RPubs report.
For further details and some charts, read on.
Read the rest…
A recent tweet:
PubMed articles containing “novel” in title or abstract 1845 – 2014
made me think (1) has it really been 5 years, (2) gee, my ggplot skills were dreadful back then and (3) did I really not know how to correct for the increase in total publications?
So here is the update, at Github and a document at RPubs.
“Novel” findings, as judged by the usage of that word in titles and abstracts really have undergone a startling increase since about 1975. Indeed, almost 7.2% of findings were “novel” in 2014, compared with 3.2% for the period 1845 – 2014. That said, if we plot using a log scale as suggested by Tal on the original post, the rate of usage appears to be slowing down. See image, right (click for larger version).
As before, none of this is novel.
I’ve been meaning to write about Entrez Direct, henceforth called edirect, for some time. This tweet provided me with an excuse:
This post is not strictly the answer to that question. Instead we’ll ask: which parent IDs of records for insects in the NCBI Taxonomy database have the most species IDs?
I am forever returning to PubMed data, downloaded as XML, trying to extract information from it and becoming deeply confused in the process.
Take the seemingly-simple question “how many retracted articles are there in PubMed?”
Back in July, I was complaining about the latest abuse of the word “database” by biologists: the “PDF as database.”
This led to some very productive discussion using PubMed Commons and I’m happy to report that misidentified and contaminated cell lines are now included in the NCBI BioSample database.
As the news release notes, rather alarmingly:
This problem is so common it is thought that thousands of misleading and potentially erroneous papers have been published using cell lines that are incorrectly identified
So it would be useful if there were a direct link between the BioSample record for a cell line and PubMed records in which it was used…
File this one under “has troubled me (and others) for some years now, let’s try to resolve it.”
Let’s use the excellent R/rentrez package to search PubMed for articles that were retracted in 2013.
es <- entrez_search("pubmed", "\"Retracted Publication\"[PTYP] 2013[PDAT]", usehistory = "y")
#  117
117 articles. Now let’s fetch the records in XML format.
xml <- entrez_fetch("pubmed", WebEnv = es$WebEnv, query_key = es$QueryKey,
rettype = "xml", retmax = es$count)
Next question: which XML element specifies the “Date of publication” (PDAT)?
I’ve been complaining about this for years. They fixed it. The NCBI have reorganised their genomes FTP site and finally, Archaea are not lumped in with Bacteria.
Archaea are still included in the ASSEMBLY_BACTERIA directory; hopefully that’s next on the list.
[*] to be fair, they’ve always recognised Archaea – just not in a form that makes downloads convenient
Just a brief technical note.
I figured that for a given compound in PubChem, it would be interesting to know whether that compound had been used in a high-throughput experiment, which you might find in GEO. Very easy using the E-utilities, as implemented in the R package rentrez:
links <- entrez_link(dbfrom = "pccompound", db = "gds", id = "62857")
#  741
Browsing the rentrez documentation, I note that db can take the value “all”. Sounds useful!
links <- entrez_link(dbfrom = "pccompound", db = "all", id = "62857")
#  0
That’s odd. In fact, this query does not even link pccompound to gds:
#  39
which(names(links) == "pccompound_gds")
It’s not a rentrez issue, since the same result occurs using the E-utilities URL.
The good people at ropensci have opened an issue, contacting NCBI for clarification. We’ll keep you posted.
Web services are great. Pass them a URL. Structured data comes back. Parse it, analyse it, visualise it. Done.
Web scraping – interacting programmatically with a web page – is not so great. It requires more code and when the web page changes, the code breaks. However, in the absence of a web service, scraping is better than nothing. It can even be rather satisfying. Early in my bioinformatics career the realisation that code, rather than humans, can automate the process of submitting forms and reading the results was quite a revelation.
In this post: how to interact with a web page at the NCBI using the Mechanize library.
Read the rest…
While we’re on the topic of mistaking Archaea for Bacteria, here’s an issue with the NCBI FTP site that has long annoyed me and one workaround. Warning: I threw this together minutes ago and it’s not fully tested.
Update July 7 2014: NCBI have changed things so code in this post no longer works
Read the rest…