Using parameters in Rmarkdown

Nothing new or original here, just something that I learned about quite recently that may be useful for others.

One of my more “popular” code repositories, judging by Twitter, is – well, Twitter. It mostly contains Rmarkdown reports which summarise meetings and conferences by analysing usage of their associated Twitter hashtags.

The reports follow a common template where the major difference is simply the hashtag. So one way to create these reports is to use the previous one, edit to find/replace the old hashtag with the new one, and save a new file.

That works…but what if we could define the hashtag once, then reuse it programmatically anywhere in the document? Enter Rmarkdown parameters.

Here’s an example .Rmd file. It’s fairly straightforward: just include a params: section in the YAML header at the top and include variables as key-value pairs:

---
params:
  hashtag: "#amca19"
  max_n: 18000
  timezone: "US/Eastern"
title: "Twitter Coverage of `r params$hashtag`"
author: "Neil Saunders"
date: "`r Sys.time()`"
output:
  github_document
---

Then, wherever you want to include the value for the variable named hashtag, simply use params$hashtag, as in the title shown here or in later code chunks.

```{r search-twitter}
tweets <- search_tweets(params$hashtag, params$max_n)
saveRDS(tweets, "tweets.rds")
```

That's it! There may still be some customisation and editing specific to each report, but parameters go a long way to minimising that work.

PMRetract: now with rake tasks

Bioinformaticians (and anyone else who programs) love effective automation of mundane tasks. So it may amuse you to learn that I used to update PMRetract, my PubMed retraction notice monitoring application, by manually running the following steps in order:

  1. Run query at PubMed website with term “Retraction of Publication[Publication Type]”
  2. Send results to XML file
  3. Run script to update database with retraction and total publication counts for years 1977 – present
  4. Run script to update database with retraction notices
  5. Run script to update database with retraction timeline
  6. Commit changes to git
  7. Push changes to Github
  8. Dump local database to file
  9. Restore remote database from file
  10. Restart Heroku application

I’ve been meaning to wrap all of that up in a Rakefile for some time. Finally, I have. Along the way, I learned something about using efetch from BioRuby and re-read one of my all-time favourite tutorials, on how to write rake tasks. So now, when I receive an update via RSS, updating should be as simple as:

rake pmretract

In other news: it’s been quiet here, hasn’t it? I recently returned from 4 weeks overseas, packed up my office and moved to a new building. Hope to get back to semi-regular posts before too long.

Automatic content for the people

Anyone who has ever built a website knows that maintaining it is a lot of work. There’s just making sure it hasn’t gone offline because the httpd daemon died. Constant monitoring for script kiddies and their SQL injections. Not to mention continually feeding it with fresh content, lest your audience become bored and desert.

I’ve always thought it would be cool to build a site that could more or less look after itself. There’s a myriad of content management systems to choose from, most of which are somewhat hackable in whatever language they happen to be coded in. One of the more mature in this respect is Drupal – which is the engine behind Eureka! Science News. It’s a fully-automated science news portal, using a bunch of customised Drupal modules to aggregate, cluster, categorise and rank articles.

First impressions are excellent. Coders will enjoy this post at Drupal explaining how it all works.

Can every workflow be automated?

Some random thoughts for a Friday afternoon.

Many excellent posts by Deepak on the topic of workflows have got me thinking about the subject. I very much like the notion that all analysis in computational biology should be automated and repeatable, so far as is practicable. However, I’ve not yet experienced a “workflow epiphany”. There are some impressive and interesting projects around, notably Taverna and myExperiment, but I see these as prototypes and testbeds for how the future might look, rather than polished solutions usable by the “average researcher”.

I also can never quite escape the feeling that this type of workflow doesn’t describe how many researchers go about their business, at least in academia. Wrong directions, dead ends, trial and error, bad decisions. To me a workflow is rather like a scientific paper: an artificial summary of your work that you put together at the end, describing an imaginary path from starting point to destination that you couldn’t know you were going to follow when you set out. Useful for others who want to follow the same path, less so for the person blazing the trail. Is this in fact the primary purpose of a workflow? To allow others to follow the same path, rather than to plan your own?

I wonder in particular about operations where manual intervention and decision making is required. In structural biology for instance, I often see my coworkers doing something like this:

  • Open experimental data (e.g. electron density) in a GUI-based application
  • “Fiddle” with it until it “looks right”
  • Save output

How do you automate that middle step? It may be that the operation is described using parameters which can be saved and run again later, but a lot of science seems to rely on a human decision as to whether something is “sensible”.

I don’t know if we can capture everything that we do in a form that a machine can run. Perhaps workflows highlight to us the difference between research versus analysis; a creative thought process versus a set of algorithms.