But just before I go…

…I have to mention Carl Zimmer’s post on the quest to find English words in human protein sequences.

This game has been around as long as sequence databases have existed. I have a vague memory of a letter from the early 1990s (~~possibly~~ in ~~Trends in Biochemical Sciences~~ Nature) in which the authors reported the results of comparing SwissProt with the Oxford English Dictionary. As I recall, the longest word that they found was ENSILISTS – meaning people who practice the art of making silage.

Anyway – here’s a quick and easy way to tackle the problem using EMBOSS and some Linux command line trickery.

Get yourself some human protein sequences:

wget ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/protein/protein.fa.gz
gunzip protein.fa.gz

Get some English words. If you have the aspell package installed and some English dictionaries, try:

aspell dump master en | grep -v "'s" | sort > words.txt

This will dump out about 102 000 words, one per line, sorted alphabetically with apostrophised words removed.

It would be nice to get those words into fasta format. Cue a bit of Perl:
```
my $file = shift;
my $count = 1;
open IN, $file;
while(<IN>) {
    chomp;
    print ">$count\n$_\n";
    $count++;
            }
close IN;
```
Save that as “fa.pl” and run “perl fa.pl words.txt > words.fa”. Each word now has a sequentially-numbered fasta header.

Run the EMBOSS program water, which performs Smith-Waterman local alignment:

water -asequence protein.fa -bsequence words.fa \
-gapopen 10.0 -gapextend 0.5 -outfile words.water

Hmm, ~ 43 MB of text output. We don’t want to read all that. Let’s pull out the pertinent lines using grep; e.g. for 6-letter matches:
```
grep -A 1 -B 1 -E '\|{6,}' words.water |less
```
Which says: look for lines that contains “|” (used in water alignments to mean an exact match) 6 or more times and show me the line above and the line below. Giving output like this:
```
--
NP_004230.1     1339 SESSELLQQE   1348
                     :||||||..|
90597              1 tessellate     10
--
NP_004230.1     1558 MENTAL   1563
                     ||||||
82405              4 mental      9
--
```

MENTAL is in fact the best that I can do. It seems appropriate, somehow. What about you? How about non-English words?

5 thoughts on “But just before I go…”

Many a biologist has searched for his or her name in protein sequence databases!

What do you mean by “the best I can do.” Six is not the longest. Inspired by True Flies, I used blastp at NCBI to search the nr protein database with STEVE as the query. There are at least 100 proteins with perfect matches.

I then thought of a 7 letter word composed entirely of common amino acids. STELLAR (yes, the first one I thought of) has four perfect matches, including these.

GENE ID: 4446407 Arth_1104 | protein coding [Arthrobacter sp. FB24]
Sbjct 131 STELLAR 137

gb|EDP31906.1| Twik (KCNK-like) family of potassium channels, alpha subunit
Sbjct 162 STELLAR 168

Wait, I just found “PRDVCTVFCRAIGVENTERENTERPRISES” in my genome. Whatever could it mean?

It wasn’t TiBS, it was Nature. Here’s the reference:

Nature 361, 121 (14 January 1993) | doi:10.1038/361121b0
A word in your protein
Gaston H. Gonnet & Steven A. Benner

There were two words tied for 1st: “ensilists” and “hidalgism”.

Thanks Henrik!

Steve – the original post by Carl Zimmer refers to human proteins. I’m sure that there are many longer words in the nr database.

Comments are closed.

What You're Doing Is Rather Desperate

Notes from the life of a [data] scientist

5 thoughts on “But just before I go…”

Share this:

Related

5 thoughts on “But just before I go…”