Chains of amino acids strung together make up proteins and since each amino acid has a 1-letter abbreviation, we can find words (English and otherwise) in protein sequences. I imagine this pursuit began as soon as proteins were first sequenced, but the first reference to protein word-finding as a sport is, to my knowledge, “Price’s Protein Puzzle”, a letter to Trends in Biochemical Sciences in September 1987 .
It occurred to me that TIBS could organise a competition to find the longest word […] contained within any known protein sequence.
The journal took up the challenge and published the winning entries in February 1988 . The 7-letter winner was RERATED, with two 6-letter runners-up: LEADER and LIVELY. The sub-genre “biological words in protein sequences” was introduced almost one year later  with the discovery of ALLELE, then no more was heard until 1993 with Gonnet and Benner’s Nature correspondence “A Word in Your Protein” .
Noting that “none of the extensive literature devoted to this problem has taken a truly systematic approach” (it’s in Nature so one must declare superiority), this work is notable for two reasons. First, it discovered two 9-letter words: HIDALGISM and ENSILISTS. Second, it mentions the technique: a Patricia tree data structure, and that the search took 23 minutes.
Comments on this letter noted one protein sequence that ends with END  and the discovery of 10-letter, but non-English words ANNIDAVATE, WALLAWALLA and TARIEFKLAS .
I last visited this topic at my blog in 2008 and at someone else’s blog in 2015. So why am I here again? Because the Aho-Corasick algorithm in R, that’s why!
Why, it seems like only 12 years since we read Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics.
And can it really be 4 years since we reviewed the topic of gene name corruption in Gene name errors and Excel: lessons not learned?
Well, here we are again in 2016 with Gene name errors are widespread in the scientific literature. This study examined 35 175 supplementary Excel data files from 3 597 published articles. Simple yet clever, isn’t it. I bet you wish you’d thought of doing that. I do. The conclusion: about 20% of the articles have associated data files in which gene names have been corrupted by Excel.
What if there is no tomorrow? There wasn’t one today.
We tell you not to use Excel. You counter with a host of reasons why you have to use Excel. None of them are good reasons. I don’t know what else to say. Except to reiterate that probably 80% or more of the data analyst’s time is spent on data cleaning and a good proportion of the dirt arises from avoidable errors.
My machines upgraded from R version 3.1.3 to version 3.2.0 last week, which means that existing code suddenly cannot find packages and so fails. Some notes to myself, possibly useful to others, for what to do when this happens. Relevant to Ubuntu-based systems (I use Linux Mint).
1. Update packages
cp ~/R/x86_64-pc-linux-gnu-library/3.1 ~/R/x86_64-pc-linux-gnu-library/3.2
1.1. rJava issues
My rJava installation failed because code was trying to compile against jni.h which was not present on my system. Solution:
sudo apt-get install openjdk-7-jdk
sudo R CMD javareconf
and then in R:
2. Update Bioconductor
Bioconductor is also upgraded so requires more than a package update. Probably need a new R session for this one.
My Bioconductor Chemminer update failed because package gridExtra was absent:
3. General issues
When R is installed on Linux Mint, some packages are installed by default in
/usr/lib/R/library. When performing updates as a non-root user, you’ll see messages telling you that this location is not writable and asking if you want to use your own library location. If you reply “yes”, you’ll have packages in both system and user locations. It’s probably better to say “no” and let the Ubuntu package management system handle the package upgrades…although when I tried that, the entire upgrade process halted…
And now we are all done so (careful!):
rm -rf ~/R/x86_64-pc-linux-gnu-library/3.1
“Take a look at the TP53 mutation database“, my colleague suggested. “OK then, I will”, I replied.
I present what follows as “a typical day in the life of a bioinformatician”.
I’m currently rather sleep-deprived and prone to doing stupid things. Like this, for example:
rsync -av ~/Dropbox /path/to/backup/directory/
where the directory
/path/to/backup/directory already contains a much-older Dropbox directory. So when I set up a new machine, install Dropbox and copy the Dropbox directory back to its default location – hey! What happened to all my space? What are all these old files? Oh wait…I forgot to delete:
rsync -av --delete ~/Dropbox /path/to/backup/directory/
Now, files can be restored of course, but not when there are thousands of them and I don’t even know what’s old and new. What I want to do is restore the directories under ~/Dropbox to the state that they were in yesterday, before I stuffed up.
Luckily Chris Clark wrote dropbox-restore. It does exactly what it says on the tin. For example:
python restore.py /Camera\ Uploads 2014-07-22
Over the years, I’ve written a lot of small “utility scripts”. You know the kind of thing. Little code snippets that facilitate research, rather than generate research results. For example: just what are the fields that you can use to qualify Entrez database searches?
Typically, they end up languishing in long-forgotten Dropbox directories. Sometimes, the output gets shared as a public link. No longer! As of today, “little code snippets that do (hopefully) useful things” have a new home at Github.
Also as of today: there’s not much there right now, just the aforementioned Entrez database code and output. I’m not out to change the world here, just to do a little better.
Next week I’ll be in Melbourne for one of my favourite meetings, the annual Computational and Simulation Sciences and eResearch Conference.
The main reason for my visit is the Bioinformatics FOAM workshop. Day 1 (March 27) is not advertised since it is an internal CSIRO day, but I’ll be presenting a talk titled “SQL, noSQL or no database at all? Are databases still a core skill?“. Day 2 (March 28) is open to all and I’ll be talking about “Learning from complete strangers: social networking for bioinformaticians“.
I imagine these and other talks will appear on Slideshare soon, at both my account and that of the Australian Bioinformatics Network.
I’m also excited to see that Victoria Stodden is presenting a keynote at the main CSS meeting (PDF) on “Reproducibility in Computational Science: Opportunities and Challenges”.
Hope to see some of you there.
A “quilt plot”
Quilt plots. Sounds interesting. The link points to a short article in PLoS ONE
, containing a table and a figure. Here is Figure 1.
If you looked at that and thought “Hey, that’s a heat map!”, you are correct. That is a heat map. Let’s be quite clear about that. It’s a heat map.
So, how do the authors justify publishing a method for drawing heat maps and then calling them “quilt plots”?
Read the rest…
May as well begin 2014 where we left off: complaining about the attitude of scientific publishers regarding reproducible computational research.
Read the rest…
Laboratory work, of the “wet” kind, not working out for you? Or perhaps you just need new challenges. Think you have some aptitude with data analysis, computers, mathematics, statistics? Maybe a switch to computational biology is what you need.
That’s the topic of the Nature Careers feature “Computing: Out of the hood“. With thoughts and advice from (on Twitter) @caseybergman, @sarahmhird, @kcranstn, @PavelTomancak, @ctitusbrown and myself.
I enjoyed talking with Roberta and she did a good job of capturing our thoughts for the article. One of these days, I might even write here about my own journey in more detail.