Extracting Sydney transport data from Twitter

The @sydstats Twitter account uses this code base, and data from the Transport for NSW Open Data API to publish insights into delays on the Sydney Trains network.

Each tweet takes one of two forms and is consistently formatted, making it easy to parse and extract information. Here are a couple of examples with the interesting parts highlighted in bold:

Between 16:00 and 18:30 today, 26% of trips experienced delays. #sydneytrains

The worst delay was 16 minutes, on the 18:16 City to Berowra via Gordon service. #sydneytrains


I’ve created a Github repository with code and a report showing some ways in which this data can be explored.

The take-home message: expect delays somewhere most days but in particular on Monday mornings, when students return to school after the holidays, and if you’re travelling in the far south-west or north-west of the network.

Using parameters in Rmarkdown

Nothing new or original here, just something that I learned about quite recently that may be useful for others.

One of my more “popular” code repositories, judging by Twitter, is – well, Twitter. It mostly contains Rmarkdown reports which summarise meetings and conferences by analysing usage of their associated Twitter hashtags.

The reports follow a common template where the major difference is simply the hashtag. So one way to create these reports is to use the previous one, edit to find/replace the old hashtag with the new one, and save a new file.

That works…but what if we could define the hashtag once, then reuse it programmatically anywhere in the document? Enter Rmarkdown parameters.

Here’s an example .Rmd file. It’s fairly straightforward: just include a params: section in the YAML header at the top and include variables as key-value pairs:

---
params:
  hashtag: "#amca19"
  max_n: 18000
  timezone: "US/Eastern"
title: "Twitter Coverage of `r params$hashtag`"
author: "Neil Saunders"
date: "`r Sys.time()`"
output:
  github_document
---

Then, wherever you want to include the value for the variable named hashtag, simply use params$hashtag, as in the title shown here or in later code chunks.

```{r search-twitter}
tweets <- search_tweets(params$hashtag, params$max_n)
saveRDS(tweets, "tweets.rds")
```

That's it! There may still be some customisation and editing specific to each report, but parameters go a long way to minimising that work.

Using OSX? Compiling an R package from source? Issues with ‘-fopenmp’? Try this.

Update 2019-07-16: this no longer works for me. I recommend you brew uninstall llvm, comment out the .R/Makevars lines and conda install llvm.

You can file this one under “I may have the very specific solution if you’re having exactly the same problem.”

So: if you’re running some R code and you see a warning like this:

Warning message:
In checkMatrixPackageVersion() : Package version inconsistency detected.
TMB was built with Matrix version 1.2.14
Current Matrix version is 1.2.15
Please re-install 'TMB' from source using 
install.packages('TMB', type = 'source') or ask CRAN for a binary 
version of 'TMB' matching CRAN's 'Matrix' package

Continue reading

Let’s (briefly) revisit the Nobel API

It’s always nice when 12-month old code runs without a hitch. Not sure why this did not become a Github repo first time around, but now it is: my RMarkdown code to generate a report using data from the Nobel Prize API.

Now you too can generate a “gee, it’s all old white men” chart as seen in The EconomistGreying of the Nobel laureates, BBC NewsWhy are Nobel Prize winners getting older? and no doubt, many other outlets every year including me, updated from 2015. As for myself, perhaps I should be offering my services to news outlets instead of publishing on blogs and obscure web platforms :)

Searching for duplicate resource names in PMC article titles

I enjoyed this article by Keith Bradnam, and the associated tweets, on the problem of duplicated names for bioinformatics software.

I figured that to some degree at least, we should be able to search for such instances, since the titles of published articles that describe software often follow a particular pattern. There may even be a grammatical term for it, but I’ll call it the announcement colon:

eDuS: Segmental Duplication Simulator
Reveel: large-scale population genotyping using low-coverage sequencing data
RNF: a general framework to evaluate NGS read mappers
Hammock: A Hidden Markov model-based peptide clustering algorithm to identify protein-interaction consensus motifs in large datasets

You get the idea. “XXX COLON a [METHOD] to [DO SOMETHING] using [SOME DATA].”

Let’s go in search of announcement colons, using titles from the PubMed Central dataset. You can find this mini-project at Github.
Continue reading

Virus hosts from NCBI taxonomy: now at Github

After my previous post on extracting virus hosts from NCBI Taxonomy web pages, Pierre wrote:

An excellent idea and here’s my first attempt.

Here’s a count of hosts. By the way NCBI, it’s environment.

cut -f4 virus_host.tsv | sort | uniq -c

   1301 
    283 algae
    114 archaea
   4509 bacteria
      8 diatom
     51 enviroment
    267 fungi
      1 fungi| plants| invertebrates
      4 human
    761 invertebrates
    181 invertebrates| plants
      7 invertebrates| vertebrates
   3979 plants
    102 protozoa
   6834 vertebrates
 115052 vertebrates| human
     43 vertebrates| human  stool
    225 vertebrates| invertebrates
    656 vertebrates| invertebrates| human

Virus hosts from NCBI Taxonomy web pages

A Biostars question asks whether the information about virus host on web pages like this one can be retrieved using Entrez Utilities.

Pretty sure that the answer is no, unfortunately. Sometimes there’s no option but to scrape the web page, in the knowledge that this approach may break at any time. Here’s some very rough and ready Ruby code without tests or user input checks. It takes the taxonomy UID and returns the host, if there is one. No guarantees now or in the future!

#!/usr/bin/ruby

require 'nokogiri'
require 'open-uri'

def get_host(uid)
	url   = "http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&lvl=3&lin=f&keep=1&srchmode=1&unlock&id=" + uid.to_s
	doc   = Nokogiri::HTML.parse(open(url).read)
	data  = doc.xpath("//td").collect { |x| x.inner_html.split("<br>") }.flatten
	data.each do |e|
		puts $1 if e =~ /Host:\s+<\/em>(.*?)$/
	end
end

get_host(ARGV[0])

Save as taxhost.rb and supply the UID as first argument. Note: I chose 12345 off the top of my head, imagining that it was unlikely to be a virus and would make a good negative test. Turns out to be a phage!

$ ruby taxhost.rb 12249
plants
$ ruby taxhost.rb 12721
vertebrates
$ ruby taxhost.rb 11709
vertebrates| human
$ ruby taxhost.rb 12345
bacteria

Some basics of biomaRt

One of the commonest bioinformatics questions, at Biostars and elsewhere, takes the form: “I have a list of identifiers (X); I want to relate them to a second set of identifiers (Y)”. HGNC gene symbols to Ensembl Gene IDs, for example.

When this occurs I have been known to tweet “the answer is BioMart” (there are often other solutions too) and I’ve written a couple of blog posts about the R package biomaRt in the past. However, I’ve realised that we need to take a step back and ask some basic questions that new users might have. How do I find what marts and datasets are available? How do I know what attributes and filters to use? How do I specify different genome build versions?
Continue reading

Exploring the NCBI taxonomy database using Entrez Direct

I’ve been meaning to write about Entrez Direct, henceforth called edirect, for some time. This tweet provided me with an excuse:

This post is not strictly the answer to that question. Instead we’ll ask: which parent IDs of records for insects in the NCBI Taxonomy database have the most species IDs?
Continue reading