…I have to mention Carl Zimmer’s post on the quest to find English words in human protein sequences.
This game has been around as long as sequence databases have existed. I have a vague memory of a letter from the early 1990s (possibly in Trends in Biochemical Sciences Nature) in which the authors reported the results of comparing SwissProt with the Oxford English Dictionary. As I recall, the longest word that they found was ENSILISTS – meaning people who practice the art of making silage.
Anyway – here’s a quick and easy way to tackle the problem using EMBOSS and some Linux command line trickery.
- Get yourself some human protein sequences:
wget ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/protein/protein.fa.gz gunzip protein.fa.gz
- Get some English words. If you have the aspell package installed and some English dictionaries, try:
- It would be nice to get those words into fasta format. Cue a bit of Perl:
my $file = shift; my $count = 1; open IN, $file; while(<IN>) { chomp; print ">$count\n$_\n"; $count++; } close IN;Save that as “fa.pl” and run “perl fa.pl words.txt > words.fa”. Each word now has a sequentially-numbered fasta header.
- Run the EMBOSS program water, which performs Smith-Waterman local alignment:
water -asequence protein.fa -bsequence words.fa \ -gapopen 10.0 -gapextend 0.5 -outfile words.water
- Hmm, ~ 43 MB of text output. We don’t want to read all that. Let’s pull out the pertinent lines using grep; e.g. for 6-letter matches:
grep -A 1 -B 1 -E '\|{6,}' words.water |lessWhich says: look for lines that contains “|” (used in water alignments to mean an exact match) 6 or more times and show me the line above and the line below. Giving output like this:
-- NP_004230.1 1339 SESSELLQQE 1348 :||||||..| 90597 1 tessellate 10 -- NP_004230.1 1558 MENTAL 1563 |||||| 82405 4 mental 9 --
aspell dump master en | grep -v "'s" | sort > words.txt
This will dump out about 102 000 words, one per line, sorted alphabetically with apostrophised words removed.
MENTAL is in fact the best that I can do. It seems appropriate, somehow. What about you? How about non-English words?


