An introduction to Bio::Graphics

I don’t normally write long “how to” articles, but I’ve been having so much fun using BioPerl’s Bio::Graphics module recently that I thought I’d write up a tutorial, of sorts. It’s quite long and assumes a reasonable knowledge of Perl and BioPerl. There may be errors – let me know if you spot any.

A brief introduction to Bio::Graphics

Generating graphics can be a lot of hassle. In fact, many bioinformaticians that I know, myself included, tend to avoid graphics altogether. The reasoning goes that if you can rapidly generate some readable plain text output that you can understand, why go to the trouble of generating some pretty pictures just to make things easier for other people? Not to mention the fact that for those us who are not professionally trained programmers, graphics libraries are rather intimidating.
That said, there are occasions when to use the old cliche, a picture speaks a thousand words. A lot of bioinformatics analysis involves collating the output from several procedures into some kind of summary and a graphic can really help out. So here’s my introduction to BioPerl’s Bio::Graphics module, which I find is a pretty quick and painless way to generate useful and attractive figures.

You’ll need BioPerl to be installed, obviously. I run version 1.5.0 RC1 on Ubuntu Linux. For Bio::Graphics to work, you also need the libgd libraries and the perl module, which lets you access them from your perl scripts. On a Linux distribution like Ubuntu or Debian these are available as packages – look for ‘libgd2’ and ‘libgd-gd2-perl’. You also need the Text::Shellwords module. Finally, try:

perl -MGD -e 'print $GD::VERSION';

and if you get a number (I get 2.23), you’re ready to roll.

Sequence features
Generating images using Bio::Graphics revolves around the idea of sequence features. A feature is really anything that you can map onto a sequence. It could be a motif, a BLAST hit or even some quantitative data such as solvent-accessible area. The simplest feature has just two properties – ‘start’ and ‘end’, which are its coordinates relative to the sequence. Other properties that you can add might be ‘score’, ‘strand’, ‘name’ and so on.
The easiest way to build a feature using BioPerl is to use the Bio::SeqFeature::Generic module. This enables you to parse the output of a program, extract the variables that make up the feature and build it up, piece by piece. For example:

use Bio::SeqFeature::Generic;
//get start, end and score somehow
my $feature = Bio::SeqFeature::Generic->new(-start => $start,
					    -end   => $end,
					    -score => $score);

It’s that simple.

Graphics in BioPerl centre around three concepts: panels, tracks and glyphs. A panel is what it sounds like – the canvas onto which you draw your graphic. A panel can be divided into tracks, which are where we place our sequence features. A feature may occupy its own track, or several features can occupy one track. Glyphs are the symbols that you do the drawing with – lines, boxes, diamonds and so on.
So, if we wanted to add our sequence feature, we first set up the panel:

use Bio::Graphics;
my $panel = Bio::Graphics::Panel->new(-length => 1000,
				      -width  => 800,
				      -pad_left => 10,
				      -pad_right > 10);

And then we add our feature to a track, specifying how we’d like it to look:

		-glyph => 'generic',
		-fgcolor => 'black',
		-bgcolor => 'blue');

When we’re done, we print out to a PNG file:

open OUT, ">mypic.png";
print OUT $panel;
close OUT;

A Real Example
OK, here’s a real world example. Recently, I was dealing with a bunch of sequences that are substrates for protein tyrosine kinases. Three types of analysis are associated with these sequences:

  1. Local alignments to find short conserved motifs in the sequence
  2. The location of the phosphorylated tyrosines, from the phosphoELM database
  3. A BLAST search of sequences in the PDB, to look for structural alignments

I’d like to take all of these data, map them onto the sequence and display the result as a Bio::Graphics panel.

Let’s start with the local alignments. The output from my program looks like this:

Score: 13.4528  Columns: 13  Sequences: 2

MET_MOUSE    +   1284 DVNTFDITIYLLQ  1296 (13.5)
GAB1_HUMAN   +     86 DINTIDRIFYLVA    98 (14.9)

(some other irrelevant stuff)

Score: 13.4528  Columns: 13  Sequences: 2

MET_MOUSE    +   1284 DVNTFDITIYLLQ  1296 (13.5)
GAB1_HUMAN   +     86 DINTIDRIFYLVA    98 (14.9)

(some other irrelevant stuff)
(repeat another 8 times)

From left to right you’ve got the sequence UniProt ID, strand (always + in proteins so not relevant), start, motif sequence, end and in parentheses, a score. The file is space-delimited and underscores are found only in the protein names. Several alignments are generated, each starting with a “Score:” line, but we only want the first. So we can do this:

1 sub parse_align {
2    my $count = 0;
3    open IN, $align_data_file;
4    while() {
5	if(/^Score:/)  {
6	    $count++;
7                      }
9	if(/^(.*?)\s+(\+)\s+(\d+)\s+(.*?)\s+(\d+)\s+\((.*?)\)$/ && $count == 1) {
10	    $data{$1}{'align'}{'start'} = $3;
11	    $data{$1}{'align'}{'motif'} = $4;
12	    $data{$1}{'align'}{'end'}   = $5;
13	    $data{$1}{'align'}{'score'} = $6;
14                                                                    }
15               }
16    close IN;
17 }

What we’ve done here is create a hash, %data, where the key is the UniProt ID. We create a second key, ‘align’, to specify that these are the alignment data, then store the variables that we’ll use later on to create our sequence feature. Each time we see ‘Score:’ we increment a counter and we only store the hash values if the count is 1 – i.e. if it’s the first alignment.
I like complex data structures (in this case a hash of hashes of hashes). You can build them up gradually as you work through the various input files, then loop through them at the end to display the result. And at any time, you can use the Data::Dumper module and ‘print Dumper %data’, to make sure that things are being stored as you expect.

So much for the local alignments. Now we turn to the phosphoELM database. This is distributed as a tab-delimited text file and a row of data looks like this:

P16104  (sequence)  139  S  11571274  ATM  LTP  2004-12-31 00:00:00+01

From left to right we’ve got a UniProt AC, the protein sequence (omitted here), position of the phosphorylation site, the phosphoresidue (S, T or Y), a PubMed UID, the name of the kinase (ATM), a method (LTP) which doesn’t concern us and a timestamp, also of no concern.
This is easy to parse, but there’s a small problem. Our local alignment data contains UniProt IDs (e.g. MET_MOUSE), whereas this file contains UniProt ACs (MET_MOUSE for instance is P16056). There are many ways to relate these identifiers – I chose to keep a local copy of the UniProt database, index it using the BioPerl tool and then grab the appropriate sequence with a second tool, This allows you to map the ID to the AC and also supplies you with a sequence object which you need to create the protein sequence feature. Let’s put all of that together.

First, we parse the phosphoELM data:

1 sub parse_elm {
2 my %modres = ('S' => 'Phosphoserine',
3 	        'T' => 'Phosphothreonine',
4	        'Y' => 'Phosphotyrosine');
6    if(-e $elm)  {
7	open IN, $elm;
8	while(<IN>)   {
9	    my @line = split("\t", $_);
10	    if($line[0] ne "acc")   {
11		my $accn = $line[0];
12		my $posn = $line[2];
13		my $code = $line[3];
14		if($accn && $posn && $modres{$code}) {
15		push @{$elm{$accn}{$modres{$code}}}, $posn;
16	                                             }
17                                  }
18                    }
19	close IN;
20                }
21 }

Here, we’re reading in a local file of the phosphoELM data which we’ve assigned previously to $elm. We split on tab and if it’s not the first line (which is a header line beginning ‘acc’), we get the UniProt AC ($accn), phosphoresidue position ($posn) and residue code (S, T or Y, $code). A quick check that we have all we need, then we use a hash, %elm, with $accn as the first key and one of ‘Phosphoserine’, ‘Phosphothreonine’ or ‘Phosphotyrosine’ as the second key to hold an array of positions for each type of phosphoresidue.
Next, we use our UniProt AC to pull out the UniProt ID from our local UniProt copy.

1 sub mod_res {
2    for my $record (sort keys %data)          {
3	my $swiss = ` --fmt swiss swiss:$record`;
4	my $seqio = Bio::SeqIO->new('-fh'     => IO::String->new($swiss),
5				    '-format' => 'swiss');
6	my $seq = $seqio->next_seq;
7          $data{$record}{'CDS'}{'accn'} = $seq->accession;
8	   $data{$record}{'CDS'}{'start'} = 1;
9	   $data{$record}{'CDS'}{'end'} = $seq->length;
11	for my $key (sort keys %{$elm{$seq->accession}})   {
12	    @{$data{$record}{$key}} = @{$elm{$seq->accession}{$key}};
13                                                         }
15                                             }
16 }

Don’t forget that perl needs to know where your sequence indices live. You can specify this as an environment variable like so:
$ENV{‘BIOPERL_INDEX’} = ‘/path/to/your/indices’;
Lines 2 and 3 get the UniProt ID ($record) and pull the sequence in SwissProt format out of the database ($swiss). In lines 4-6, there’s a bit of BioPerl trickery. We create a sequence object directly from $swiss using a filehandle and IO::String to pretend that $swiss is a file, which saves us having to write out the sequence and read it in again. Whilst we have the sequence, we get its start and end to use later on in a sequence feature, storing those values under the ‘CDS’ key. Lines 11-13 loop through %elm and map the array of phosphoresidue positions into our main hash, %data, under the UniProt ID, rather than the UniProt AC.

Now for the big one. Wouldn’t it be nice to take our UniProt sequence, BLAST it against sequences from the PDB database, parse the BLAST output and display the best HSP as a sequence feature on our graphic, more or less in one step?
Well if BioPerl can do it (and it can), then so can you.

1 sub blast_pdb {
2 my $factory = Bio::Tools::Run::StandAloneBlast->new('program'  => "blastp",
3						      'database' => "pdbaa",
4						      'F'        => "F",
5						      'g'        => "F");
7   for my $record (sort keys %data) {
8	my $swiss = ` swiss:$record`;
9	my $seqio = Bio::SeqIO->new('-fh'     => IO::String->new($swiss),
10				    '-format' => 'fasta');
11	my $seq = $seqio->next_seq;
12	$data{$record}{'blast'} = ();
13	my $search = $factory->blastall($seq);
15	while(my $result = $search->next_result) {
16	    while(my $hit = $result->next_hit) {
17		while(my $hsp = $hit->next_hsp) {
18		    next if($hsp->evalue > 1e-10);
19	if($hsp->start('query') end('query')   >= $data{$record}{'align'}{'end'}) {
21 ## existing record
22		if($data{$record}{'blast'}{'hit_name'}) {
23		    if($hsp->evalue name;
25			$data{$record}{'blast'}{'evalue'} = $hsp->evalue;
26			$data{$record}{'blast'}{'score'} = $hsp->score;
27			$data{$record}{'blast'}{'start'} = $hsp->start('query');
28			$data{$record}{'blast'}{'end'} = $hsp->end('query');
29                                                                       }
30                                                       }
31 ## first record
32			else {
33			$data{$record}{'blast'}{'hit_name'} = $hit->name;
34			$data{$record}{'blast'}{'evalue'} = $hsp->evalue;
35			$data{$record}{'blast'}{'score'} = $hsp->score;
36			$data{$record}{'blast'}{'start'} = $hsp->start('query');
37			$data{$record}{'blast'}{'end'} = $hsp->end('query');
39                           }
40                                                                }
41                                              }
42                                             }
43                                               }
44                                   }
45 }

Phew, there’s a lot going on in there. In line 2 we define our BLAST parameters using Bio::Tools::Run::StandAloneBlast. You need to have local BLAST set up correctly and formatted databases (in this case pdbaa) for this to work. We get our UniProt sequence into a sequence object again, then run the BLAST (line 13) to create a Search::IO object ($search). In lines 15-43, we use the Search::IO methods to get HSPs, ignoring those with an E-value of > 1e-10. We’re only interested in those HSPs which span our alignment motif (lines 19-20). Then we store the variables that we’ll use to create the HSP sequence feature in $data{$record}{‘blast’}. The first time around (lines 31-37), we store values from the first HSP. Subsequently (lines 21-28), we only store values if the HSP E-value is less than the one already stored. In this way, we end up with only the ‘best’ HSP.

Let’s summarise what we’ve done. We parse 3 types of data – local alignments, phosphoELM data and a BLAST report. We’ve extracted the variables that we need to create sequence features (start, end, score, names) and stored them in a complex hash. Remember Data::Dumper? If we do a ‘print Dumper %data’, our hash of stored values, for one UniProt protein, looks like this:

$VAR2 = {
          'Phosphotyrosine' => [
          'CDS' => {
                     'accn' => 'P16056',
                     'end' => 1379,
                     'start' => 1
          'blast' => {
                       'evalue' => '0.0',
                       'score' => '1617',
                       'hit_name' => 'pdb|1R1W|A',
                       'end' => 1358,
                       'start' => 1047
          'align' => {
                      'score' => '13.5',
                      'end' => 1296,
                      'motif' => 'DVNTFDITIYLLQ',
                      'start' => 1284

So, we have everything that we need to create sequence features from phosphoresidues, BLAST HSPs and local alignments, mapped onto our protein.
Without further ado, here’s the render() subroutine.

1 sub render {
2 for my $record(sort keys %data) {
3     my %phospho;
4     my $panel = Bio::Graphics::Panel->new(-width     => 760,
5  				            -key_style => 'between',
6				            -length    => $data{$record}{'CDS'}{'end'},
7					    -pad_left  => 20,
8					    -pad_right => 20
9                                          );
11  my $full_length = Bio::SeqFeature::Generic->new(-start        => 1,
12						    -end          => $data{$record}{'CDS'}{'end'},
13						    -display_name => $record
14                                                 );
16 ## ruler track
17	   $panel->add_track($full_length,
18                 	     -glyph   => 'arrow',
19 	                     -tick    => 2,
20	                     -fgcolor => 'black',
21	                     -double  => 1
22                          );
24 ## CDS track
25	   $panel->add_track($full_length,
26			     -key     => 'CDS',
27			     -glyph   => 'transcript2',
28			     -fgcolor => 'black',
29			     -bgcolor => 'orange',
30			     -height  => 10,
31			     -label   => 1
32                          );
34 ## ALIGN track	
35 my $align = Bio::SeqFeature::Generic->new(-start => $data{$record}{'align'}{'start'},
36			         -end               => $data{$record}{'align'}{'end'},
37			         -score             => $data{$record}{'align'}{'score'},
38			         -display_name      => $data{$record}{'align'}{'motif'}
39                                          );
41  $panel->add_track($align,
42		    -key   => sub{
43                                return 'ALIGN: motif ='.$align->display_name.
44                                       ', posn = '.
45                                       $align->start.
46                                       ' - '.
47                                       $align->end;
48                               },
49                  -glyph => 'generic'
50                   );
52 ## phosphosites track
53	for my $key (sort keys %{$data{$record}}) {
54	  if($key =~/(Phosphoserine|Phosphothreonine|Phosphotyrosine)/) {
55	    for(@{$data{$record}{$key}}) {
56		my $f = Bio::SeqFeature::Generic->new(-start => $_,
57						      -end   => $_);
58		push @{$phospho{$1}}, $f;
59                                       }
60                                                                      }
61                                                }
63  for my $tag (sort keys %phospho) {
64    my $features = $phospho{$tag};
65     $panel->add_track($features,
66		         -key         => $tag,
67		         -glyph       => 'diamond',
68		         -fgcolor     => 'black',
69		         -bgcolor     => 'red',
70		         -font2color  => 'red',
71		         -description => sub{
72                                           my $feat = shift;
73                                           return $feat->start;
74                                          }
75                      );
76                                   }
78 ## blast track
79	if($data{$record}{'blast'}{'hit_name'}) {
80 my $f = Bio::SeqFeature::Generic->new(-start        => $data{$record}{'blast'}{'start'},
81				         -end          => $data{$record}{'blast'}{'end'},
82				         -display_name => $data{$record}{'blast'}{'hit_name'},
83				         -score        => $data{$record}{'blast'}{'evalue'}
84                                      );
86 $panel->add_track($f,
87		     -key       => sub{return 'PDB BLAST: E ='.$f->score;},
88		     -glyph     => 'generic',
89		     -fgcolor   => 'black',
90		     -bgcolor   => 'blue',
91		     -label     => 1
92                  );
93                                                     }
95 ## print PNG
96    open OUT, ">$record.png";
97    print OUT $panel->png;
98    close OUT;
99                                        }
100 }

Again, that looks like a lot of code but it’s fairly simple. We loop through %data and for each key ($record), set up a new panel (lines 4-8). The panels are 800 pixels wide (760 for drawing and 20 of padding each side). Panel length is defined by the full length of our protein.
At line 11, we create our first sequence feature, $full_length. We use the UniProt ID ($record) to define display_name. Next, at line 17, we create our first track. This is a scale for the image (glyph ‘arrow’), with tick marks.
At line 25, we draw our first real sequence feature which is the full-length protein (‘CDS track’). The ‘transcript2’ glyph has a pointy end indicating the C-terminus.
Next comes the ‘ALIGN’ track (lines 34-50). First we create a sequence feature with start, end, score and display_name – in this case, the motif itself. Then we add the feature to the ‘ALIGN’ track. This add_track() routine illustrates the use of a subroutine to determine what will be displayed for the key – it will show the motif and the start/end coordinates.
The ‘phosphosites track’ is a little more complex still. A protein may contain >1 phosphosite of 3 types – phosphoserine, phosphothreonine or phosphotyrosine. We want a track for all 3 types and each phosphosite feature to be placed in the correct track. So we use a hash, %phospho, where the keys are each type (lines 52-61). Then we create a sequence feature for each phosphosite and push them onto an array, under the key of the right type. In lines 63-76, we create the 3 types of track and just drop the array of sequence features into it. This is a nice feature of Bio::Graphics and Bio::SeqFeature::Generic – you can use arrays of features (or even references to arrays of features) and just drop them straight into your graphics. The phosphosites are shown as a red diamond and again, we use a subroutine to define the glyph ‘description’. In this case, the position of the site will be shown underneath the diamond.
Finally we add the ‘BLAST track’. Again, it’s just a case of looping through %data, getting the HSP information for each protein, creating the sequence feature and passing it to add_track with some information about how the track should look.
With our track and panels complete, all that remains is to generate the PNG graphic file (lines 95-98). This will create files with names like ‘MET_MOUSE.png’.

And the end results look like this:


In this example, the only phosphosites are phosphotyrosines. The local alignment motif (cyan box) lies towards the C-terminus of the protein and this region has a good BLAST hit to 1R1W, chain A.


In this case, the motif lies near the N-terminus of the protein. There are no significant PDB blast hits in the region, but the protein contains both phosphoserine and phosphotyrosines.

Going further
I’ve only touched on the many Bio::Graphics methods and variables here, as the main emphasis is on creating features from various data. A much better tutorial is the BioPerl Graphics HowTo. Also useful is the BioPerl Feature Annotation HowTo, which explains rich sequence feature objects very well. You should also consult the BioPerl documentation for more information on how to use Bio::Graphics, Bio::SeqFeature::Generic and the other modules mentioned here.

One thought on “An introduction to Bio::Graphics

Comments are closed.