Just how many retracted articles are there in PubMed anyway?

I am forever returning to PubMed data, downloaded as XML, trying to extract information from it and becoming deeply confused in the process.

Take the seemingly-simple question “how many retracted articles are there in PubMed?”

Well, one way is to search for records with the publication type “Retracted Article”. As of right now, that returns a count of 3550.

library(rentrez)

retracted <- entrez_search("pubmed", "\"Retracted Publication\"[PTYP]")
retracted$count
[1] "3550"

Another starting point is retraction notices – the publications which announce retractions. We search for those using the type “Retraction of Publication”.

retractions <- entrez_search("pubmed", "\"Retraction of Publication\"[PTYP]")
retractions$count
[1] "3769"

So there are more retraction notices than retracted articles. Furthermore, a single retraction notice can refer to more than one retracted article. If we download all retraction notices as PubMed XML (file retractionOf.xml), we see that the retracted articles referred to by a retraction notice are stored under the node named CommentsCorrectionsList:

        <CommentsCorrectionsList>
            <CommentsCorrections RefType="RetractionOf">
                <RefSource>Ochalski ME, Shuttleworth JJ, Chu T, Orwig KE. Fertil Steril. 2011 Feb;95(2):819-22</RefSource>
                <PMID Version="1">20889152</PMID>
            </CommentsCorrections>
        </CommentsCorrectionsList>

There are retraction notices without a CommentsCorrectionsList. Where it is present, there are CommentsCorrections without PMID but always (I think) with RefSource. So we can count up the retracted articles referred to by retraction notices like this:

doc.retOf <- xmlTreeParse("retractionOf.xml", useInternalNodes = TRUE)
ns.retOf <- getNodeSet(doc.retOf, "//MedlineCitation")
sources.retOf <- lapply(ns.retOf, function(x) xpathSApply(x, ".//CommentsCorrections[@RefType='RetractionOf']/RefSource", xmlValue))

# count RefSource per retraction notice - first 10
head(sapply(sources.retOf, length), 10)
# [1] 0 1 1 1 1 1 1 1 1 1

# total RefSource
sum(sapply(sources, length))
# [1] 3898

It appears then that retraction notices refer to 3 898 articles, but only 3 550 of type “Retracted Publication” are currently indexed in PubMed. Next question: of the PMIDs for retracted articles linked to from retraction notices, how many match up to the PMID list found in the downloaded PubMed XML file for all “retracted article” (retracted.xml) ?

# "retracted publication"
doc.retd <- xmlTreeParse("retracted.xml", useInternalNodes = TRUE)
pmid.retd <- xpathSApply(doc.retd, "//MedlineCitation/PMID", xmlValue)
# "retraction of publication"
pmid.retOf <- lapply(ns.retOf, function(x) xpathSApply(x, ".//CommentsCorrections[@RefType='RetractionOf']/PMID", xmlValue))

# count PMIDs linked to from retraction notice
sum(sapply(pmid.retOf, length))
# [1] 3524

# and how many correspond with "retracted article"
length(which(unlist(pmid.retOf) %in% pmid.retd))
# [1] 3524

So there are, apparently, at least (see comments) 26 (3550 – 3524) retracted articles that have a PMID, but that PMID is not referred to in a retraction notice.

In summary
It’s like the old “how long is a piece of string”, isn’t it. To summarise, as of this moment:

  • PubMed contains 3 769 retraction notices
  • Those notices reference 3 898 sources, of which 3 524 have PMIDs
  • A further 26 retracted articles have a PMID not referenced by a retraction notice

What do we make of the (3898 – 3550) = 348 articles referenced by a retraction notice, but not indexed by PubMed? Could they be in journals that were not indexed when the article was published, but indexing began prior to publication of the retraction notice?

You can see from all this that linking retraction notices with the associated retracted articles is not easy. And if you want to do interesting analyses such as time to retraction – well, don’t even get me started on PubMed dates…

3 thoughts on “Just how many retracted articles are there in PubMed anyway?

  1. Pingback: Weekend reads: Widespread p-hacking; sexism in science (again); retraction totals - Retraction Watch at Retraction Watch

    • That would be interesting to know and is on the to-do list, when I have a spare moment. This is not my day job, unfortunately…

      EDIT: now that I think about it, the correct number is probably not (3550 – 3524) = 26. What we’d like to know is which of the “retracted publication” PMIDs are not contained in the “retraction of” list. It seems that there are 61 such PMIDs.

      setdiff(pmid.retd, unlist(pmid.retOf))
      [1] "25654141" "24849946" "24438873" "24294399" "24251390" "24237818" "24129711" "22207911" "22099836" "21291296" "21225599" "20697107" "20633622" "20588141" "20462659" "20302847" "20112019"
      [18] "19995902" "19817692" "19797702" "19484686" "19481479" "19455077" "18974775" "18852117" "18783367" "18463039" "18057596" "18025093" "17958618" "17975671" "17374986" "17327231" "17123960"
      [35] "16890687" "16862580" "16487061" "16468883" "16243185" "15961411" "15703780" "15284244" "15215246" "15170272" "15032629" "12968031" "12796562" "12517745" "12466290" "12424250" "12141316"
      [52] "12077412" "11696542" "11565755" "11418471" "11007230" "10993888" "10490643" "10449788" "9096355" "367632"

Comments are closed.