For better or worse I spend some time each day at Stack Overflow [r], reading and answering questions. If you do the same, you probably notice certain features in questions that recur frequently. It’s as though everyone is copying from one source – perhaps the one at the top of the search results. And it seems highest-ranked is not always best.
Nowhere is this more apparent to me than in the way many users create data frames. So here is my introductory guide “how not to create data frames”, aimed at beginners writing their first questions.
Chains of amino acids strung together make up proteins and since each amino acid has a 1-letter abbreviation, we can find words (English and otherwise) in protein sequences. I imagine this pursuit began as soon as proteins were first sequenced, but the first reference to protein word-finding as a sport is, to my knowledge, “Price’s Protein Puzzle”, a letter to Trends in Biochemical Sciences in September 1987 .
It occurred to me that TIBS could organise a competition to find the longest word […] contained within any known protein sequence.
The journal took up the challenge and published the winning entries in February 1988 . The 7-letter winner was RERATED, with two 6-letter runners-up: LEADER and LIVELY. The sub-genre “biological words in protein sequences” was introduced almost one year later  with the discovery of ALLELE, then no more was heard until 1993 with Gonnet and Benner’s Nature correspondence “A Word in Your Protein” .
Noting that “none of the extensive literature devoted to this problem has taken a truly systematic approach” (it’s in Nature so one must declare superiority), this work is notable for two reasons. First, it discovered two 9-letter words: HIDALGISM and ENSILISTS. Second, it mentions the technique: a Patricia tree data structure, and that the search took 23 minutes.
Comments on this letter noted one protein sequence that ends with END  and the discovery of 10-letter, but non-English words ANNIDAVATE, WALLAWALLA and TARIEFKLAS .
I last visited this topic at my blog in 2008 and at someone else’s blog in 2015. So why am I here again? Because the Aho-Corasick algorithm in R, that’s why!
The recent ABC News article Australia’s pollution mapped by postcode reveals nation’s dirty truth is interesting. It contains a searchable table, which is useful if you want to look up your own suburb. However, I was left wanting more: specifically, the raw data and some nice maps.
So here’s how I got them, using R.
Update 2019-07-16: this no longer works for me. I recommend you brew uninstall llvm, comment out the .R/Makevars lines and conda install llvm.
You can file this one under “I may have the very specific solution if you’re having exactly the same problem.”
So: if you’re running some R code and you see a warning like this:
In checkMatrixPackageVersion() : Package version inconsistency detected.
TMB was built with Matrix version 1.2.14
Current Matrix version is 1.2.15
Please re-install 'TMB' from source using
install.packages('TMB', type = 'source') or ask CRAN for a binary
version of 'TMB' matching CRAN's 'Matrix' package
The article Cytotoxic T cells modulate inflammation and endogenous opioid analgesia in chronic arthritis contains a statement that I don’t recall seeing before:
Availability of data and materials
We do not wish to share our data at this moment.
Sydney’s congestion at ‘tipping point’
Dual-axes at tipping-point
blares the headline and to illustrate, an interactive chart with bars for city population densities, points for commute times and of course, dual-axes.
Yuck. OK, I guess it does show that Sydney is one of three cities that are low density, but have comparable average commute times to higher-density cities. But if you’re plotting commute time versus population density…doesn’t a different kind of chart come to mind first? y versus x. C’mon.
I love it when researchers take the time to share their knowledge of the computational tools that they use. So first, let me point you at Environmental Computing, a site run by environmental scientists at the University of New South Wales, which has a good selection of R programming tutorials.
One of these is Making maps of your study sites. It was written with the specific purpose of generating simple, clean figures for publications and presentations, which it achieves very nicely.
I’ll be honest: the sole motivator for this post is that I thought it would be fun to generate the map using Leaflet for R as an alternative. You might use Leaflet if you want:
- An interactive map that you can drag, zoom, click for popup information
- A “fancier” static map with geographical features of interest
- concise and clean code which uses pipes and doesn’t require that you process shapefiles
The code that generated the report (which I’ve used heavily and written about before) is at Github too. A few changes required compared with previous reports, due to changes in the
rtweet package, and a weird issue with kable tables breaking markdown headers.
I love that the most popular media attachment is a screenshot of a Github repo.
“Some R functions have an awful lot of arguments”, you think to yourself. “I wonder which has the most?”
A brief message for anyone who uses my PubMed retractions report. It’s no longer available at RPubs; instead, you will find it here at Github. Github pages hosting is great, once you figure out that
docs/ corresponds to your web root :)
Now I really must update the code and try to make it more interesting than a bunch of bar charts.