PubMed Publication Date: what is it, exactly?

File this one under “has troubled me (and others) for some years now, let’s try to resolve it.”

Let’s use the excellent R/rentrez package to search PubMed for articles that were retracted in 2013.

library(rentrez)

es <- entrez_search("pubmed", "\"Retracted Publication\"[PTYP] 2013[PDAT]", usehistory = "y")
es$count
# [1] 117

117 articles. Now let’s fetch the records in XML format.

xml <- entrez_fetch("pubmed", WebEnv = es$WebEnv, query_key = es$QueryKey, 
                    rettype = "xml", retmax = es$count)

Next question: which XML element specifies the “Date of publication” (PDAT)?

To make a long story short: there are several nodes in PubMed XML that contain the word “Date”, but the one which looks most promising is named PubDate. Given that our search used the year (2013), you might think that years can be extracted using the XPath expression //PubDate/Year. You would be mostly, but not entirely right.

doc <- xmlTreeParse(xml, useInternalNodes = TRUE)
table(xpathSApply(doc, "//PubDate/Year", xmlValue))
# 2013 2014 
#  111    2 

Well, that’s confusing. Not only do we not get the expected total number of years (117), but two of them have the value 2014. Time to delve deeper into the nodes under PubDate.

children <- xpathSApply(doc, "//PubDate", xmlChildren)
table(names(unlist(children)))

#         Day MedlineDate       Month        Year 
#          25           4          87         113 

table(xpathSApply(doc, "//PubDate/MedlineDate", xmlValue))

# 2013 Jan-Mar 2013 May-Jun 2013 Nov-Dec 2013 Oct-Dec 
#            1            1            1            1 

Interesting. So in addition to //PubDate/Year, 4 records have a node named //PubDate/MedlineDate.

It’s also possible to retrieve records in docsum format, which is also XML but with a different structure. Here, PubDate is an attribute of an Item node.

ds <- entrez_fetch("pubmed", WebEnv = es$WebEnv, query_key = es$QueryKey,
                   rettype = "docsum", retmax = es$count)
ds.doc <- xmlTreeParse(ds, useInternalNodes = TRUE)
table(xpathSApply(ds.doc, "//Item[@Name='PubDate']", xmlValue))

#         2013     2013 Apr   2013 Apr 1   2013 Apr 2     2013 Aug  2013 Aug 15  2013 Aug 29 
#           23            7            1            1            2            2            1 
#     2013 Dec   2013 Dec 1     2013 Feb  2013 Feb 26   2013 Feb 7     2013 Jan  2013 Jan 24 
#            3            1            6            1            1           10            2 
#   2013 Jan 3  2013 Jan 30   2013 Jan 7 2013 Jan-Mar     2013 Jul  2013 Jul 25     2013 Jun 
#            1            1            1            1            4            1            3 
#  2013 Jun 18   2013 Jun 5   2013 Jun 7     2013 Mar   2013 Mar 1  2013 Mar 12  2013 Mar 28 
#            1            1            1            5            1            1            1 
#   2013 Mar 9     2013 May   2013 May 1  2013 May 29   2013 May 6   2013 May 8   2013 May 9 
#            1            4            3            1            1            2            1 
#     2013 Nov 2013 Nov-Dec     2013 Oct 2013 Oct-Dec     2013 Sep  2013 Sep 30     2014 Feb 
#            8            1            2            1            5            1            1 
#     2014 Jan 
#            1 

A fair old mix of formats in there then, and still the issue of the 2014 years when we searched for PDAT = 2013. We can split on space to get years:

yr <- xpathSApply(ds.doc, "//Item[@Name='PubDate']", function(x) strsplit(xmlValue(x), " ")[[1]][1])
which(yr == "2014")
# [1] 16 26

And examine records 16 and 26:

xmlRoot(ds.doc)[[16]] # complete output not shown
# <DocSum>
#   <Id>24156249</Id>
#   <Item Name="PubDate" Type="Date">2014 Jan</Item>
#   <Item Name="EPubDate" Type="Date">2013 Oct 25</Item>

xmlRoot(ds.doc)[[26]] # complete output not shown
# <DocSum>
#   <Id>24001238</Id>
#   <Item Name="PubDate" Type="Date">2014 Feb</Item>
#   <Item Name="EPubDate" Type="Date">2013 Sep 4</Item>

Not every record has EPubDate. Is it simply the case that where it exists and is earlier than PubDate, then EPubDate == PDAT?

So we haven’t really resolved very much, have we?

  • we started with the Entrez search term PDAT (Date of publication)
  • both PubMed XML and DocSum contain something called PubDate
  • in the former case, most child node names = Year, but some = MedlineDate
  • we retrieve some records where PubDate year = 2014, even when searching for 2013[PDAT]

It appears that PDAT does not map consistently to any XML node in either XML or DocSum formats. It might be derived from (1) EPubDate, where that exists and is earlier than PubDate, or (2) PubDate, where EPubDate does not exist.

2 thoughts on “PubMed Publication Date: what is it, exactly?

  1. You might want to check out my Python MEDLINE tool (no need to know much of Python to use it) to get easier access to MEDLINE fields – although PubDate is a text field and would have to be text-mined to be useful beyond grep-ing the year. I once tried to set up a few RegEx patterns to work with it, but it turned out to be a headache: seasons, months, numeric references in Roman numerals, you name it – it’s probably in there…

    https://pypi.python.org/pypi/medic/

  2. I think PDAT includes both dates from the XML, so you could add NOT 2014[PPDAT] to the query, but it is confusing. This is from PubMed help at http://www.ncbi.nlm.nih.gov/books/NBK3827/#pubmedhelp.Publication_Date_DP

    -If an article is published electronically and in print on different dates both dates are searchable and may be included on the citation prefaced with an Epub or Print label. The electronic date will not be searchable if it is later than the print date, except when range searching.
    -To search for electronic dates only use the search tag [EPDAT], for print dates only tag with [PPDAT]

Comments are closed.