Searching for duplicate resource names in PMC article titles

I enjoyed this article by Keith Bradnam, and the associated tweets, on the problem of duplicated names for bioinformatics software.

I figured that to some degree at least, we should be able to search for such instances, since the titles of published articles that describe software often follow a particular pattern. There may even be a grammatical term for it, but I’ll call it the announcement colon:

eDuS: Segmental Duplication Simulator
Reveel: large-scale population genotyping using low-coverage sequencing data
RNF: a general framework to evaluate NGS read mappers
Hammock: A Hidden Markov model-based peptide clustering algorithm to identify protein-interaction consensus motifs in large datasets

You get the idea. “XXX COLON a [METHOD] to [DO SOMETHING] using [SOME DATA].”

Let’s go in search of announcement colons, using titles from the PubMed Central dataset. You can find this mini-project at Github.

1. Download PMC data
I use wget. The compressed archives are still quite large (~ 3-5 GB), so this may take some time.

wget ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.A-B.tar.gz
wget ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.C-H.tar.gz
wget ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.I-N.tar.gz
wget ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.O-Z.tar.gz

find ./ -name "*.tar.gz" -exec tar zxvf {} \;

2. Parse the titles
Now, of course there will be many article titles that contain a colon and are nothing to do with software names. We’ll worry about that later when we start counting things.

Quick and dirty Ruby code to 1. open and parse a PMC XML file; 2. extract PMC ID and title; 3. print out those titles starting with “anything followed by a colon”. It’s not the best way to generate tab-delimited output, but it works. Note that titles in PMC XML can contain line breaks, which we need to remove (by replacing with a space). The output file has 3 columns: PMC uid, the part of the title preceding the colon (we’ll call that the “pretitle”), and the full title.

#!/usr/bin/ruby

require "nokogiri"

f   = File.open(ARGV[0])
doc = Nokogiri::XML(f)
f.close

ameta  = doc.xpath("//article/front/article-meta")
pmc    = ameta.xpath("//article-id[@pub-id-type='pmc']").text.chomp
title  = ameta.xpath("//title-group/article-title").text.chomp

if title =~ /^(.*?):/
  r = [pmc, $1, title.gsub("\n", " ")]
  puts r.join("\t")
end

We can make that much quicker using GNU parallel. Assuming that the XML files were extracted into directory pmc under the current working directory:

find ./pmc -name "*.nxml" | parallel ./pmc2title.rb {} > pmctitles.tsv

3. Count the duplicate terms
Now we have something that R can read easily. As ever, some cleaning is necessary.

  1. the pretitle is converted to lower case, for counting
  2. the PMC dataset contains duplicate records, which can be removed using the UID
  3. After summing the pretitles, we select only those that occur 2 or more times and order by frequency
ti <- read.delim("pmctitles.tsv", header=FALSE, stringsAsFactors=FALSE)
colnames(ti)    <- c("uid", "pretitle", "title")
ti$pretitle.low <- tolower(ti$pretitle)

ti.uniq <- ti[!duplicated(ti[, "uid"]), ]
ti.cnt  <- as.data.frame(table(ti.uniq$pretitle.low), stringsAsFactors = FALSE)
ti.cnt  <- subset(ti.cnt, Freq > 1)
ti.cnt  <- ti.cnt[order(ti.cnt$Freq, decreasing = TRUE), ]

There are quite a few duplicated pretitles – too many to inspect quickly.

nrow(ti.cnt)
[1] 3318

So let’s assume, as is often the case, that software articles usually have one word before the colon and that word is the software name. Of course, there will be many instances where the word is not a software name. Let’s also assume that duplicate software names are unlikely to occur very many times; certainly less than 10 and perhaps less than 5.

ti.one  <- ti.cnt[-grep(" ", ti.cnt$Var1), ]

nrow(ti.one)
[1] 740

ti.one10 <- subset(ti.one, Freq < 11)

# most duplicates occur 2-3 times
table(ti.one10$Freq)

  2   3   4   5   6   7   8   9  10 
476 120  43  24  19   6   6   3   1

All that remains is to match the pretitles in ti.one10 with those in ti.uniq, write out the results and stare at them.

ti.in <- ti.in[order(ti.in$pretitle.low), ]
write.table(ti.in, file = "pmctitles_matched.tsv", sep = "\t", quote = FALSE, 
row.names = FALSE, col.names = FALSE)

4. Did we find anything?
Sure did. There comes a point where manual curation is unavoidable so – here is the file of candidate duplicate names for software or computational resources. Note: there may be cases where the name is something else, such as a clinical trial or protocol.

Some were identified previously by Keith: comet, muscle, snap, medusa. Many from his list are missing, meaning either that the duplicated names are not in the PMC data or the procedure to extract them failed.

Plenty of new entries. Tempting to use “SNiPer” for your SNPs, but think twice. Likewise, VIPR (3 entries). Who’d have thought that there’d be two unrelated COMBREX? And even venerable workflow framework Taverna has a competitor.

As Keith said, the take-home message is simply: do your research before you name things.

5 thoughts on “Searching for duplicate resource names in PMC article titles

  1. Andy

    I’m sure you thought of this and it won’t explain that much of the variance, but isn’t it also possible that this is due to two papers being published around the same time? If they were within 6-12 months of one another, it might have been impossible or too time-consuming to change the name and not worth it.

    1. nsaunders Post author

      Could be! Easy enough to answer, we could extract dates from the data too. My anecdotal impression is that there’s usually a few years between instances, but would need to back that up with data.

  2. avil

    Interesting post ! I think you may miss quite a few duplicated names, although I don’t know exactly why. For instance, PRISM: pair-read informed split-read mapping for base-pair level detection of insertion, deletion and structural variants. (http://www.ncbi.nlm.nih.gov/pubmed/22851530) seems to follow the pattern and I don’t see it in your list ?

    1. nsaunders Post author

      Yes, some things are not in PMC, other things don’t get extracted. This is just a first pass.

      Your example is in PubMed but apparently, not PMC.

  3. Nora

    I really wish I knew what you all were discussing. I always loved science but chose to go in another direction. Everytime I read blogs like this I become so jealous. Not sure what a computational biologist does but it sounds SUPER interesting. By the way, I fould your blog by googling my screen name I use for everything. “nsaunders”

Comments are closed.