I am forever returning to PubMed data, downloaded as XML, trying to extract information from it and becoming deeply confused in the process.
Take the seemingly-simple question “how many retracted articles are there in PubMed?”
Well, one way is to search for records with the publication type “Retracted Article”. As of right now, that returns a count of 3550.
library(rentrez) retracted <- entrez_search("pubmed", "\"Retracted Publication\"[PTYP]") retracted$count  "3550"
Another starting point is retraction notices – the publications which announce retractions. We search for those using the type “Retraction of Publication”.
retractions <- entrez_search("pubmed", "\"Retraction of Publication\"[PTYP]") retractions$count  "3769"
So there are more retraction notices than retracted articles. Furthermore, a single retraction notice can refer to more than one retracted article. If we download all retraction notices as PubMed XML (file retractionOf.xml), we see that the retracted articles referred to by a retraction notice are stored under the node named CommentsCorrectionsList:
<CommentsCorrectionsList> <CommentsCorrections RefType="RetractionOf"> <RefSource>Ochalski ME, Shuttleworth JJ, Chu T, Orwig KE. Fertil Steril. 2011 Feb;95(2):819-22</RefSource> <PMID Version="1">20889152</PMID> </CommentsCorrections> </CommentsCorrectionsList>
There are retraction notices without a CommentsCorrectionsList. Where it is present, there are CommentsCorrections without PMID but always (I think) with RefSource. So we can count up the retracted articles referred to by retraction notices like this:
doc.retOf <- xmlTreeParse("retractionOf.xml", useInternalNodes = TRUE) ns.retOf <- getNodeSet(doc.retOf, "//MedlineCitation") sources.retOf <- lapply(ns.retOf, function(x) xpathSApply(x, ".//CommentsCorrections[@RefType='RetractionOf']/RefSource", xmlValue)) # count RefSource per retraction notice - first 10 head(sapply(sources.retOf, length), 10) #  0 1 1 1 1 1 1 1 1 1 # total RefSource sum(sapply(sources, length)) #  3898
It appears then that retraction notices refer to 3 898 articles, but only 3 550 of type “Retracted Publication” are currently indexed in PubMed. Next question: of the PMIDs for retracted articles linked to from retraction notices, how many match up to the PMID list found in the downloaded PubMed XML file for all “retracted article” (retracted.xml) ?
# "retracted publication" doc.retd <- xmlTreeParse("retracted.xml", useInternalNodes = TRUE) pmid.retd <- xpathSApply(doc.retd, "//MedlineCitation/PMID", xmlValue) # "retraction of publication" pmid.retOf <- lapply(ns.retOf, function(x) xpathSApply(x, ".//CommentsCorrections[@RefType='RetractionOf']/PMID", xmlValue)) # count PMIDs linked to from retraction notice sum(sapply(pmid.retOf, length)) #  3524 # and how many correspond with "retracted article" length(which(unlist(pmid.retOf) %in% pmid.retd)) #  3524
So there are, apparently, at least (see comments) 26 (3550 – 3524) retracted articles that have a PMID, but that PMID is not referred to in a retraction notice.
It’s like the old “how long is a piece of string”, isn’t it. To summarise, as of this moment:
- PubMed contains 3 769 retraction notices
- Those notices reference 3 898 sources, of which 3 524 have PMIDs
- A further 26 retracted articles have a PMID not referenced by a retraction notice
What do we make of the (3898 – 3550) = 348 articles referenced by a retraction notice, but not indexed by PubMed? Could they be in journals that were not indexed when the article was published, but indexing began prior to publication of the retraction notice?
You can see from all this that linking retraction notices with the associated retracted articles is not easy. And if you want to do interesting analyses such as time to retraction – well, don’t even get me started on PubMed dates…