…I have to mention Carl Zimmer’s post on the quest to find English words in human protein sequences.
This game has been around as long as sequence databases have existed. I have a vague memory of a letter from the early 1990s (possibly in Trends in Biochemical Sciences Nature) in which the authors reported the results of comparing SwissProt with the Oxford English Dictionary. As I recall, the longest word that they found was ENSILISTS – meaning people who practice the art of making silage.
Anyway – here’s a quick and easy way to tackle the problem using EMBOSS and some Linux command line trickery.
- Get yourself some human protein sequences:
wget ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/protein/protein.fa.gz gunzip protein.fa.gz
- Get some English words. If you have the aspell package installed and some English dictionaries, try:
- It would be nice to get those words into fasta format. Cue a bit of Perl:
my $file = shift; my $count = 1; open IN, $file; while(<IN>) { chomp; print ">$count\n$_\n"; $count++; } close IN;
Save that as “fa.pl” and run “perl fa.pl words.txt > words.fa”. Each word now has a sequentially-numbered fasta header.
- Run the EMBOSS program water, which performs Smith-Waterman local alignment:
water -asequence protein.fa -bsequence words.fa \ -gapopen 10.0 -gapextend 0.5 -outfile words.water
- Hmm, ~ 43 MB of text output. We don’t want to read all that. Let’s pull out the pertinent lines using grep; e.g. for 6-letter matches:
grep -A 1 -B 1 -E '\|{6,}' words.water |less
Which says: look for lines that contains “|” (used in water alignments to mean an exact match) 6 or more times and show me the line above and the line below. Giving output like this:
-- NP_004230.1 1339 SESSELLQQE 1348 :||||||..| 90597 1 tessellate 10 -- NP_004230.1 1558 MENTAL 1563 |||||| 82405 4 mental 9 --
aspell dump master en | grep -v "'s" | sort > words.txt
This will dump out about 102 000 words, one per line, sorted alphabetically with apostrophised words removed.
MENTAL is in fact the best that I can do. It seems appropriate, somehow. What about you? How about non-English words?
Many a biologist has searched for his or her name in protein sequence databases!
What do you mean by “the best I can do.” Six is not the longest. Inspired by True Flies, I used blastp at NCBI to search the nr protein database with STEVE as the query. There are at least 100 proteins with perfect matches.
I then thought of a 7 letter word composed entirely of common amino acids. STELLAR (yes, the first one I thought of) has four perfect matches, including these.
GENE ID: 4446407 Arth_1104 | protein coding [Arthrobacter sp. FB24]
Sbjct 131 STELLAR 137
gb|EDP31906.1| Twik (KCNK-like) family of potassium channels, alpha subunit
Sbjct 162 STELLAR 168
Wait, I just found “PRDVCTVFCRAIGVENTERENTERPRISES” in my genome. Whatever could it mean?
It wasn’t TiBS, it was Nature. Here’s the reference:
Nature 361, 121 (14 January 1993) | doi:10.1038/361121b0
A word in your protein
Gaston H. Gonnet & Steven A. Benner
There were two words tied for 1st: “ensilists” and “hidalgism”.
Thanks Henrik!
Steve – the original post by Carl Zimmer refers to human proteins. I’m sure that there are many longer words in the nr database.