Analysis of retractions in PubMed

As so often happens these days, a brief post at FriendFeed got me thinking about data analysis. Entitled “So how many retractions are there every year, anyway?”, the post links to this article at Retraction Watch. It discusses ways to estimate the number of retractions and in particular, a recent article in the Journal of Medical Ethics (subscription only, sorry) which addresses the issue.

As Christina pointed out in a comment at Retraction Watch, there are thousands of scientific journals of which PubMed indexes only a fraction. However, PubMed is relatively easy to analyse using a little Ruby and R. So, here we go…

Code and raw data used for this post are available at Github.

1. Searching for retractions
In the Journal of Medical Ethics article, the authors state: “Every research paper noted as retracted in the PubMed database from 2000 to 2010 was evaluated. PubMed was searched on 22 January 2010 with the limits of ‘items with abstracts, retracted publication, English.’ A total of 788 retracted papers were identified…”

Not a bad approach. There’s another way: at the PubMed website, find a retraction and examine the record in XML format. You’ll see this:

  <PublicationType>Retraction of Publication</PublicationType>

The equivalent in Medline format is:

PT  - Retraction of Publication

This means that retractions have a particular type: Publication Type, or PTYP for short. If you search at the PubMed website using the term “Retraction of Publication[Publication Type]”, you will retrieve (at the time of writing) ~ 1621 records.

2. Retrieving retraction counts by year
Armed with this information, we can modify the Ruby code that I’ve posted previously to retrieve total and retracted publications between 1900 and 2010. This generates a tab-delimited file with 3 columns: year, total publications and retracted publications.

3. Retraction count analysis
Here’s the R code to analyse the retraction counts. There are no recorded retractions until 1977, so we’ll start from that year.

First, a simple plot of retractions for each year.

So, retractions are increasing rapidly. No surprise there, since the total number of publications per year is also increasing rapidly. We need some kind of normalization.


PubMed retractions 1977 - 2010

Chris got there first with this graphic, showing retractions each year per 100 000 publications. Here’s my version.

Indeed, it seems that with each year, retractions constitute a greater proportion of publications for that year.


PubMed retractions 1977 - 2010 (per 100K by year)

Another way to examine the trend is to use the cumulative sum of both total publications and retractions over time. In other words for each year, instead of looking at the numbers for just that year, we look at the total records accumulated in PubMed to date. Here’s that plot.

This shows a smoother upwards trend, with a rapid increase from 2005 onwards.


PubMed retractions 1977 - 2010 (per 100K, cumulative)

Finally, we can compare the growth rate of total and retracted publications. One way to do this is to choose 1977 as the baseline and for each year, calculate the percentage increase in both publication types relative to 1977. Here’s the result.

This is somewhat alarming. Whilst there are about 4x as many total publications in Pubmed now as there were in 1977, the total number of retractions has risen almost 550x.


Percent increase relative to 1977, cumulative

4. Analysis of Medline data
Using the search term described earlier in the post to retrieve retractions, we can download a file in Medline format. Medline records contain various fields of interest, including the ROF (retraction of) line, describing the publication that was retracted.

Or – as it turns out in some cases – publications. One retraction record may include the retraction of several publications, as we can see with a simple grep:

grep -c "^PMID" retractions.medline && grep -c "^ROF" retractions.medline

We won’t worry about that too much, since the majority of retraction records reference one publication.

Here is some R code that performs two simple, similar analyses of the Medline file. First, the top 10 journals for retractions:

                            so Freq
667   Proc Natl Acad Sci U S A   54
707                    Science   52
590                     Nature   42
388                J Biol Chem   32
450                  J Immunol   28
157                       Cell   20
92  Biochem Biophys Res Commun   16
116                      Blood   16
413              J Clin Invest   15
566              Mol Cell Biol   15

A brief glance at that list suggests that higher impact factor = more retractions. We would want to know the total number of publications for those journals to make more sense of that.

Second, the top 10 countries. Note – this is country of publication:

              pl Freq
45 united states  856
12       england  373
28   netherlands   83
15       germany   47
23         japan   42
6          china   25
2      australia   19
24 korea (south)   19
10       denmark   17
42   switzerland   14

Not especially surprising; the ones with the most researchers/scientific output. Again, we’d want more data before drawing any conclusions.

Final thoughts

  • Analysis of all kinds of data from PubMed is relatively straightforward. As to the factors underlying the recent rise in retractions: the JME focuses on fraud. Your thoughts are welcome.
  • It strikes me that it would be relatively easy to build a web application (Rails, Heroku), which constantly monitors retraction data at PubMed and generates a variety of statistics and charts.
  • The post at Retraction Watch lists a variety of estimates for numbers of retractions: 328 from 1995-2004, 529 from 1988-2008 and, most amusingly, 95 in 2008for the entire Thomson Reuters Science Citation Index. Given that there are 237 records in PubMed alone for 2008, you have to wonder what the Times Higher Education Supplement paid for the latter study. And people wonder why we don’t trust impact factors.

11 thoughts on “Analysis of retractions in PubMed

  1. Just another nice example how much we can improve how we do science if we properly record things. Here it is PubMed doing the good work. Would it be hard to normalize the Top 10 retractions by country by output for those countries? That would reflect better which countries are more stressed by publish-or-perish.

  2. Pingback: Open Laboratory 2010 – submissions now closed – see all the entries | A Blog Around The Clock

  3. Thanks for sharing this analysis, as always, a very concise use of existing tools and data to answer a relevant question. I like the idea of a web app that automatically updates based on mining pubmed data. I would be interested to see retractions by field or topic ? What would it indicate though ? the level of rubbish published in a field, or the complexity of that field…

    • That’s always the issue with data – what does it actually mean? I guess retractions occur for many reasons, the commonest being “someone messed up”. Can’t see a way to extract the reasons, other than reading the full retraction and classifying manually.

  4. Hmm,. I wonder if there is a chart with the fraction of TT positions. It’s probably declining in sync with the increase in retractions. Correlation or causality?

  5. Hello Neil,
    Thanks for this analysis, very insightful and exciting stuff! I was surprised to see the Netherlands making an appearance in the top 3 of countries. In the raw data I saw that many publications listed as coming from the Netherlands, actually come from other countries (USA, China, India). e.g. . Do you have any idea what is causing this discrepancy?


  6. My colleague and I have published on this topic and found that the majority of retractions were not because of fraud. A brief publication appeared in JAMA and a full study in Journal of the medical Library Association. I am now retired and have discontinued this research.

  7. Pingback: Gary’s Worst of the Best of 2010 | The Panic Manual

  8. Pingback: Tidbits, 2010 end-of-year cleanout | Book of Trogool

Comments are closed.