But just before I go…

…I have to mention Carl Zimmer’s post on the quest to find English words in human protein sequences.

This game has been around as long as sequence databases have existed. I have a vague memory of a letter from the early 1990s (possibly in Trends in Biochemical Sciences Nature) in which the authors reported the results of comparing SwissProt with the Oxford English Dictionary. As I recall, the longest word that they found was ENSILISTS – meaning people who practice the art of making silage.

Anyway – here’s a quick and easy way to tackle the problem using EMBOSS and some Linux command line trickery.

  1. Get yourself some human protein sequences:
    wget ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/protein/protein.fa.gz
    gunzip protein.fa.gz
    
  2. Get some English words. If you have the aspell package installed and some English dictionaries, try:
  3. aspell dump master en | grep -v "'s" | sort > words.txt
    

    This will dump out about 102 000 words, one per line, sorted alphabetically with apostrophised words removed.

  4. It would be nice to get those words into fasta format. Cue a bit of Perl:
    my $file = shift;
    my $count = 1;
    open IN, $file;
    while(<IN>) {
        chomp;
        print ">$count\n$_\n";
        $count++;
                }
    close IN;
    

    Save that as “fa.pl” and run “perl fa.pl words.txt > words.fa”. Each word now has a sequentially-numbered fasta header.

  5. Run the EMBOSS program water, which performs Smith-Waterman local alignment:
    water -asequence protein.fa -bsequence words.fa \
    -gapopen 10.0 -gapextend 0.5 -outfile words.water
    
  6. Hmm, ~ 43 MB of text output. We don’t want to read all that. Let’s pull out the pertinent lines using grep; e.g. for 6-letter matches:
    grep -A 1 -B 1 -E '\|{6,}' words.water |less
    

    Which says: look for lines that contains “|” (used in water alignments to mean an exact match) 6 or more times and show me the line above and the line below. Giving output like this:

    --
    NP_004230.1     1339 SESSELLQQE   1348
                         :||||||..|
    90597              1 tessellate     10
    --
    NP_004230.1     1558 MENTAL   1563
                         ||||||
    82405              4 mental      9
    --
    

MENTAL is in fact the best that I can do. It seems appropriate, somehow. What about you? How about non-English words?

5 thoughts on “But just before I go…

  1. What do you mean by “the best I can do.” Six is not the longest. Inspired by True Flies, I used blastp at NCBI to search the nr protein database with STEVE as the query. There are at least 100 proteins with perfect matches.

    I then thought of a 7 letter word composed entirely of common amino acids. STELLAR (yes, the first one I thought of) has four perfect matches, including these.

    GENE ID: 4446407 Arth_1104 | protein coding [Arthrobacter sp. FB24]
    Sbjct 131 STELLAR 137

    gb|EDP31906.1| Twik (KCNK-like) family of potassium channels, alpha subunit
    Sbjct 162 STELLAR 168

  2. It wasn’t TiBS, it was Nature. Here’s the reference:

    Nature 361, 121 (14 January 1993) | doi:10.1038/361121b0
    A word in your protein
    Gaston H. Gonnet & Steven A. Benner

    There were two words tied for 1st: “ensilists” and “hidalgism”.

  3. Thanks Henrik!

    Steve – the original post by Carl Zimmer refers to human proteins. I’m sure that there are many longer words in the nr database.

Comments are closed.