How to: map protein sequence onto chromosomal coordinates using BioPerl

My coding challenge this week: given a protein sequence and its exons, how do you map single amino acid residues to a location on a DNA sequence? It’s trickier than you might think. Read on for my latest BioPerl how-to.

Basic concepts
Picture a hypothetical protein sequence, 7 amino acid residues long. Those residues are encoded by 7 x 3 = 21 bases. Therefore, to describe the location of an amino acid on a DNA sequence, we require 3 positions.

In prokaryotes, the codons for our protein sequence are contiguous. If I tell you that the gene for our 7 amino acid protein begins at position 1000 on the chromosome, you can tell me that amino acid #1 maps to (1000, 1001, 1002), amino acid #2 maps to (1003, 1004, 1005) and so on, up to amino acid #7 at (1018, 1019, 1020). In general, the codon for amino acid #N starts at gene start + (3N-3) and ends at gene start + (3N-1).

This might be the case in a eukaryote too, but we may also need to deal with introns and exons. Let’s say that our protein is encoded by 2 exons; exon #1 from 1000-1012, exon #2 from 2000-2007. The chromosomal coordinates of our amino acids now look like this:

#1 1000-1002
#2 1003-1005
#3 1006-1008
#4 1009-1011
#5 1012, 2000-2001
#6 2002-2004
#7 2005-2007

Here we see that amino acid #5 crosses a splice site and so the corresponding codon contains bases from 2 exons. In other words, you can’t take the start position of the gene and find the coordinates of an amino acid using the simple equations given above.

BioPerl Location objects
Fortunately, BioPerl provides us with methods to deal with coordinates like those of amino acid #1 (1000-1002) and amino acid #5 (1012, 2000-2001). The first type is modelled using a simple Bio::LocationI object. For the second we use the more complex Bio::Location::SplitLocationI object and the associated sub_Location() method, which returns an array of sub-features from a main feature (such as exons in a gene).

For more details please consult the excellent BioPerl Feature Annotation HOWTO or the BioPerl documentation for Bio::Range and Bio::Location.

Getting exons from a gene
Given a sequence file in GenBank format, here’s a subroutine to parse it for coding features (CDS), the corresponding genes, the exons of those genes and the translated protein sequence. You could call it using “my %features = get_features(‘mySeqFile’, ‘CDS’)”.

1.  sub get_features {
2.      my %features;
3.      my ($seqfile, $ptag) = @_;
4.      my $seqio  = Bio::SeqIO->new('-file' => $seqfile, '-format' => 'genbank');
5.      my $seqobj = $seqio->next_seq;
6.      my $src    = $seqobj->id;
7.      my $length = $seqobj->length;
8.      my @cds    = grep { $_->primary_tag eq $ptag $seqobj->get_SeqFeatures;
9.      for my $feat(@cds) {
10.         for my $gene($feat->get_tag_values('gene')) {
11.             @{$features{$src}{$ptag}{$gene}{'product'}}     = $feat->get_tag_values('product');
12.             @{$features{$src}{$ptag}{$gene}{'protein_id'}}  = $feat->get_tag_values('protein_id');
13.             @{$features{$src}{$ptag}{$gene}{'translation'}} = $feat->get_tag_values('translation');
14.               $features{$src}{$ptag}{$gene}{'strand'}       = $feat->strand;
15.               $features{$src}{$ptag}{$gene}{'start'}        = $feat->start;
16.               $features{$src}{$ptag}{$gene}{'end'}          = $feat->end;
17. ## location
18. 	if($feat->location->isa('Bio::Location::SplitLocationI')) {
19. 	    my @locn = $feat->location->sub_Location;
20. 	    for(my $i = 0; $i <= $#locn; $i++) {
21. 		    @{$ret{$src}{$ptag}{$gene}{'exon'}[$i]} = ($locn[$i]->start,$locn[$i]->end);
22.                                            }
23.                                                               }
24. 	else {
25. 	        @{$ret{$src}{$ptag}{$gene}{'exon'}[0]} = ($feat->start,$feat->end);
26.          }
27.                                                     }
28. ## length of source feature
29.     $features{$src}{'length'} = $length;
30.                        }
31. return %features;
32. }

Looks a little fearsome – but only because I like to store things in complex data structures! First, we create a sequence object from the GenBank file (line 5) and get a couple of basic properties from the sequence: an identifier (line 6) and length (line 7). Next, we use some Perl shorthand (line 8 ) to build an array of sequence feature objects (@cds) where the primary tag of the feature is ‘CDS’. Then we loop through the array of features (line 10) pulling out some values for those features: the gene name, protein product name, protein ID, translated protein sequence and for the gene its start, end and strand (lines 10-16). All of these go into a complex hash, %features with a hierarchy of hash keys.

Yes I know – in this example, there are no checks to see if the tags exist. Please include them in your own code.

Lines 17-27 are where we find the exons. We check whether the location of the feature is of type Bio::Location::SplitLocationI (line 18). If so we create an array of exons using the sub_Location() method (line 19), loop through it and add an array of arrays to %features, where the first array index is the exon and the second array contains (exon start, exon end) (line 21). Otherwise we have a simple Bio::Location object (no introns) and we just get the start and end of the feature (line 25).

Finally we store the length of the whole DNA sequence in %features (line 29) – just in case – and return the hash.

Mapping codons to exons
Now it’s time to “think algorithmically”. Given a protein sequence, the start/end of its gene and the starts/ends of the exons, how do we map each amino acid to the DNA sequence? Something like this:

  1. Make an array where elements 0->N represent amino acid residues 1->N+1.
  2. If the strand is +, start at the first position of the first exon. This is the position for the first base of the first codon.
  3. Increment the position by one – this is the position for the next base of the codon. Repeat.
  4. If the codon is 3 bases long, start a new codon.
  5. If the position is greater than the exon length, move to the next exon and reset the start position
  6. Finish when we reach the end of the last exon.
  7. The array index for each codon equals the array index of the amino acid – so we can now map codon DNA coordinates to amino acid residues

If the strand is -, we simply reverse the process by starting at the end of the last exon and working our way to the start of the first exon.

Here’s how that looks in our Perl code. Just for fun, let’s map only codons that encode Ser, Thr or Tyr residues.

33.  sub map_codons {
34.    for my $src(keys %features) {
35.  	  for my $gene(keys %{$features{$src}{'CDS'}}) {
36.  	    my $strand = $features{$src}{'CDS'}{$gene}{'strand'};
37.  	    my @exons  = @{$features{$src}{'CDS'}{$gene}{'exon'}};
38.  	    my $protein = ${$features{$src}{'CDS'}{$gene}{'translation'}}[0]."*";
39.  	    my @residues = split("", $protein);
40.  	    my @codons;
41.  ## initialise the codon and its first base
42.  	    my $codon = 0;
43.  	    my $base  = 0;
44.  ## forward
45.  	    if($strand == 1) {
46.  		for(my $exon = 0; $exon <= $#exons; $exon++) {
47.		    for(my $start = $exons[0][0]; $start <= $exons[$#exons][1]; $start++) {
48.		      if($start > $exons[$exon][1]) {
49.  			    $exon++;
50.  			    $start = $exons[$exon][0];
51.                                                 }
52.  			$codons[$base] = $start;
53.  			$base++;
54.  			if($base == 3) {
55.  			    if($residues[$codon] =~/[STY]/) { # get S/T/Y codons only
56.  				@{$features{$src}{'CDS'}{$gene}{'codons'}{$&}{$codon+1}} = @codons;
57.                                                         }
58.  			    $base = 0;
59.  			    $codon++;
60.                                    }
61.                                                                                       }
62.                                                          }
63.                          }
64.  ## reverse
65.  ## start is still < end, just start with last exon and work backwards
66.  	    elsif($strand == -1) {
67.            # AN EXERCISE FOR THE READER :)
68.                              }
69.  	                                               }
70.                              }
71.  }

Off we go, looping through our hash keys (lines 34-39), pulling out the strand, the array of exons and the translated sequence. Hold it – why are we appending an asterisk (*) to the protein sequence (line 38) ? That’s because the last 3 bases from the coordinates of the gene are a stop codon, not the last amino acid residue. When we convert the protein sequence to an array (line 39), the array indices won’t match those of the codons unless we include a character to represent stop.

We set up an array to hold the codons and initialise the codon and base count (lines 40-43). Then we employ the algorithm that we devised earlier to loop through the exons (lines 45-63). Once we have a complete 3-base codon (line 54), a regular expression checks the corresponding amino acid residue in @residues and if it’s S, T or Y the positions of the codon bases are stored in %features under the keys ‘codons’, the residue and its position in the amino acid sequence (line 56). We reset (line 58) and start on a new codon (line 59). Lines 48-51 skip to the next exon once we go over the end of the current one.

Once again, note the horrible lack of checks that things are as they should be in this code.

So what can we do with %features? Well, we might loop though and print out the gene name, exon, codon, codon base, position on chromosome and amino acid residue. Here’s a small sample of output using the file NC_001224.gbk (the mitochondrial genome of S. cerevisiae), without the limit on S/T/Y residues:

COB     0       0       1       36540   M
COB     0       0       2       36541   M
COB     0       0       3       36542   M
COB     0       1       1       36543   A
COB     0       1       2       36544   A
COB     0       1       3       36545   A
....
COB     0       137     1       36951   Q
COB     0       137     2       36952   Q
COB     0       137     3       36953   Q
COB     0       138     1       36954   M
COB     1       138     2       37723   M
COB     1       138     3       37724   M
....
COB     5       385     1       43645   *
COB     5       385     2       43646   *
COB     5       385     3       43647   *
AI2     0       0       1       13818   M
AI2     0       0       2       13819   M
AI2     0       0       3       13820   M
AI2     0       1       1       13821   V
AI2     0       1       2       13822   V
AI2     0       1       3       13823   V

Bear in mind that Perl arrays start at zero whereas sequences start at one when you look at those zeroes (and write your code).

Here we see that the gene COB contains 6 exons (0-5). The codon for amino acid residue M139 (add one!) spans a splice site between exons 1 and 2 (add one!). Later on COB ends and a new gene AI2 begins. We can cross-reference these start-end positions with the original GenBank file to make sure that everything is working properly.

Representation of codons
Finally, the homework question. You don’t often see codons represented this way in, for instance GFF files. I wonder whether it’s sensible to represent a codon as 3 bases (especially if they span a splice site), or whether most people would represent the chromosomal coordinates of an amino acid as a single point: perhaps using the first base of the codon?

I hope this is useful to you; corrections are welcome, as ever.

6 thoughts on “How to: map protein sequence onto chromosomal coordinates using BioPerl

  1. Aaron

    Or, you could have used Bio::Coordinate::GeneMapper, and saved yourself a fair amount of time and code.

    -Aaron

  2. Vasilis J Promponas

    Thanks for a nice piece of code.
    It is a succssful illustration of a seemingly easy-to-tackle task, which would require 100′s lines of code (and probably hours of debugging!!) …

    Sincerely,

    Vasilis Promponas

    PS: “7 x 3 = 21 codons” should read “7 x 3 = 21 bases”

  3. nsaunders Post author

    Thanks Vasilis – typo fixed.

    Or, you could have used Bio::Coordinate::GeneMapper, and saved yourself a fair amount of time and code

    Indeed you could – provided that (1) you’re familiar with Bio::Coordinate (which I’m not) and (2) you’ve read and understood the (currently rather patchy) documentation for the module (which I hadn’t). Aaron makes a good point though: always check whether a library exists for your task, before writing a heap of code. Thanks for the pointer.

    However, this post isn’t really about how to save time using libraries; it’s about the mental process of converting biological objects to computational representations. As such, the first-principles approach does no harm once in a while – and can help you understand what’s going on inside those black-box modules.

  4. Pingback: BioBlogs 19: Bioengineering « O’Really? at Duncan.Hull.name

  5. Prof.S.Krupanidhi

    The present article “How to: map protein sequence onto chromosomal coordinates using BioPerl” deals with the the development of a bioperl programme for locating the code for the chosen aminoacids in both prokaryote and eukaryote genome. It is excellent exercise and innovative. As the genome sequences are available, there is lot to derive the secondary data from genome through bioinformtics analysis.

Comments are closed.