Abstract word clouds using R

A recent question over at BioStar asked whether abstracts returned from a PubMed search could easily be visualised as “word clouds”, using Wordle.

This got me thinking about ways to solve the problem using R. Here’s my first attempt, which demonstrates some functions from the RCurl and XML packages.

update: corrected a couple of copy/paste errors in the code

First, install a couple of packages: snippets, which provides the cloud() function for plotting a word cloud and tm, a text-mining library:


Next, the code to search PubMed, fetch abstracts and generate a list of words:


# esearch
url <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?"
q   <- "db=pubmed&term=saunders+nf[au]&usehistory=y"
esearch <- xmlTreeParse(getURL(paste(url, q, sep="")), useInternal = T)
webenv  <- xmlValue(getNodeSet(esearch, "//WebEnv")[[1]])
key     <- xmlValue(getNodeSet(esearch, "//QueryKey")[[1]])

# efetch
url <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?"
q   <- "db=pubmed&retmode=xml&rettype=abstract"
efetch <- xmlTreeParse(getURL(paste(url, q, "&WebEnv=", webenv, "&query_key=", key, sep="")), useInternal = T)
abstracts <- getNodeSet(efetch, "//AbstractText")

# words
abstracts <- sapply(abstracts, function(x) { xmlValue(x) } )
words <- tolower(unlist(lapply(abstracts, function(x) strsplit(x, " "))))


Word cloud for abstracts

Let’s run through that. First, load up the libraries (lines 1-4). Next, define an EUtils Esearch URL (lines 7-8). Use getURL() (RCurl) to fetch the search result in XML format and xmlTreeParse() (XML) to parse the result into a NodeSet object (line 9). Extract the content of the WebEnv and QueryKey tags, to use when we fetch the abstracts (lines 10-11).

To retrieve the abstracts: define an EUtils Efetch URL, fetch the XML and parse as before (lines 14-17). This time, the NodeSet object, abstracts, contains the AbstractText tags and their contents. We can run sapply on each abstract to pull out the text between the tags (line 20). Finally, we split each abstract into words by looking for spaces (” “), put all of the words in one big list and convert them all to lower-case, using the “one-liner” on line 20. Conversion to lower-case ensures that words are not counted twice (e.g. “The” and “the”).

That’s a good start, but there is still some work to do. For a start, many of the words are not strictly words, because they include punctuation symbols. We can get rid of the symbols using grep:

# remove parentheses, comma, [semi-]colon, period, quotation marks
words <- words[-grep("[\\)\\(,;:\\.\\'\\\"]", words)]

We’re probably not interested in “words” composed solely of numerals:

words <- words[-grep("^\\d+$", words)]

We’re definitely not interested in commonly-used words such as: “a, and, the, we, that, which, was, those…” and so on. These are referred to as stopwords – and this is where the tm package is useful. It provides a list of stopwords, to which we can compare our word list and remove matches:

words <- words[!words %in% stopwords()]

OK – we are just about ready to plot the word cloud. Count them up using table(), remove those that occur only once and plot:

wt <- table(words)
wt <- wt[wt > 1]
cloud(wt, col = col.br(wt, fit=TRUE))

Result: see the graphic, above-right (click on it for the full-size version).

It’s a start, if not quite so attractive as a Wordle. The tm package looks worthy of further investigation; it contains many more functions than the simple use of stopwords() illustrated here.

18 thoughts on “Abstract word clouds using R

  1. Hello,

    I don’t get this function to work. I think I loaded all the required packages, but it already crashes on line 9:
    > esearch <- xmlTreeParse(getURL(paste(url, q, sep=""), useInternal = T))
    Warning message:
    In mapCurlOptNames(names(.els), asNames = TRUE) :
    Unrecognized CURL options: useinternal
    (I added the extra ")" on the end of the line as R didn't accept continueing of the argument).

    I am still a beginner with R but this seemed a very nice application. What do I do wring?
    Thanks, Paul

  2. Funny, I got the same idea of using snippets/cloud() stuff. Unfortunately, as you can see the cloud isn’t so nice and some words overlap with each others. I have dig a little bit in the code and find two reasons for such misbehavior :

    – 1/ One error in cloud source code which affects the line spacing
    – 2/ With some lowercase characters (p/l/q/…) the R function strheight() didn’t return a true enclosing height. (try to plot a text(x,y,”polo”) + rect() using strwidth()/strheight() .

    Thus, I have fixed this two issues in the following function (yspace/xspace are additional fix spaces add to each line/word.. other arguments are the same than the original cloud() function) :

    fixCloud <- function (w, col, yspace = 0.02, xspace = 0.01, minh = 0, …)
    if (missing(col))
    col <- "#000000"
    omar <- par("mar")
    par(mar = c(0, 0, 0, 0))
    plot(0:1, 0:1, type = "n", axes = FALSE)
    x = 0
    y = 1
    xch = minh
    cm = 3/max(w) + 0.25
    . 0.98) {
    x <<- 0
    y <<- y – (yspace + xch)
    xch < xch)
    xch <<- cth
    text(x, y, w, cex = cex, adj = c(0, 1), col = col[i])
    ## lines(c(x,x+ctw),c(y,y))
    x <<- x + ctw + xspace

  3. Love your blog and want to include lines of R code like you do in my WordPress blog. Whats the best way? Thanks for the great tutorials!

  4. I always thought it would be cool to make a Word cloud sizing the words according to the probability of their appearance in a corpus of written English. This would allow you to meaningfully make a word cloud from just an abstract or two.

  5. This is fantastically helpful. I have one question. I’d like to tag each abstract with the year of publication.
    Does anyone have a suggestion on how to revise the fetch to pull back this date (year is all I need)?


    • Hi Rob – the line that you would need to change is line 17. The efetch variable contains all of the information but in the code shown, getNodeSet() extracts only the AbstractText tags and their content.

      The publication date tag is “PubDate” and it contains 3 tags, Year, Month and Day. So you’d need to retrieve those tags, get the contents using xmlValue() and stitch them together to form a date.

  6. Neil,

    Thanks. I’ll work at this. Have you ever tried to get the efetch object into the tm() package?

    (There’s two parts to that question.. one part is getting the data in to tm() and the other is trying to coax tm() to capture and use the publication date (year, to be precise) in its analysis.

    Any thoughts would be most welcome.

    Many thanks,


  7. I’d welcome some additional guidance on this.

    am trying to cut out multiple single year extracts from a journal. So, I want to loop across the years of interest, capturing textual data in a way that is indexed by year. Let’s focus on titles. I want to end up with 1980 titles, 1981 titles, 1982 titles, each in a separate data structure that I can then pass to tm() and its tools.

    Sounds like a perfect job for a for loop. but this is where I get tangled up.

    Here’s what I want to do this this (pseudo code):

    1) for (YEAR in 1980:1990) {
    2) url <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?&quot;
    3) q <- "db=pubmed&term=0417360[so]&usehistory=y&datetype=pdat&mindate=YEAR&maxdate=YEAR"
    4) esearch <- xmlTreeParse(getURL(paste(url, q, sep="")), useInternal = T)5
    5) webenv <- xmlValue(getNodeSet(esearch, "//WebEnv")[[1]])
    6) key <- xmlValue(getNodeSet(esearch, "//QueryKey")[[1]])

    7) url <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?&quot;
    8) q <- "db=pubmed&retmode=xml&rettype=abstract"
    8) efetch <- xmlTreeParse(getURL(paste(url, q, "&WebEnv=", webenv, "&query_key=", key, sep="")), useInternal = T)
    10) titles <-getNodeSet(efetch, "//ArticleTitle")
    11) Titles[YEAR] <- sapply(titles, function(x) { xmlValue(x) } )
    12) }

    I need to save and then post-process Titles data using tm() tools.

    I am failing on two points:

    1) How do I get YEAR into the pubmed search (inside the double-quote marks? The quote marks mean that YEAR is not interpreted according to its value within the FOR loop, so I need some way to concatonate bits and pieces of the expression, or somehow escape the quote marks, but I've not been successful. My R skills are not that strong.

    2) I'm not sure how to index Titles so that I can pass the recovered titles in a way that allows me to pass the titles by year into tm() routines.

    I'd welcome any suggestions someone might have.

  8. Answering my own first question. The syntax:

    q <- paste("db=pubmed&term=0417360[so]&usehistory=y&datetype=pdat&mindate=",YEAR,"&maxdate=", YEAR, sep="")

    seems to work nicely

    Other problems arise, of course, but I think this part is OK

  9. Pingback: Who Are The Top Biotech Influencers on Twitter? | Biotechnology and Life Science Marketing Consulting: Comprendia

Comments are closed.