Parse the PDB using Bioperl

I think it’s time for a quick Perl tutorial. This week: how to extract PDB entries based on 3 factors: a domain that interests you, the number of chains and their length.

Here’s my problem. I’d like a list of structures where a protein kinase has been crystallised with a peptide substrate. Unfortunately, a term like “peptide substrate” isn’t a standard annotation that we can use to query a structural or sequence database, so we do things somewhat indirectly:

  • First, identify all protein kinase catalytic domains in the PDB
  • Second, keep only those structures with > 1 chain
  • Third, keep only those chains that (a) don’t contain the kinase domain and (b) are less than a specified length

On a Linux box or similar, we can download our input file from the NCBI FTP server and uncompress it like so:

gunzip pdbaa.gz

The file, pdbaa, contains sequences in fasta format for all protein chains represented in the PDB.

We also need a HMM, in this case for the protein kinase catalytic domain. Quickest way to grab that is to visit the relevant Pfam page, look for the “download HMM” link at the bottom of the page and grab the one called Pkinase_ls, saving it as “Pkinase.hmm”.

The perl script is called It requires the Bioperl core and run libraries and uses the HMMER package. There are undoubtedly smaller, more elegant ways to write this code but for demonstration purposes, we’re going to use a great big hash. Off we go then, starting with the preamble:

#!/usr/bin/perl -w
# use the pdbaa file as input
# finds PDBs with > 1 chain and a specific HMM
# prints out associated chains <= to specified length

use strict;
use Bio::SeqIO;
use Bio::Tools::Run::Hmmer;
use Getopt::Long;
use Data::Dumper;

my($infile, $hmm, $len);
GetOptions('seqfile=s' => \$infile, 'hmm=s' => \$hmm, 'chainlen=i' => \$len);
die("Usage = --seqfile  --hmm  --chainlen \n")
    if(!$infile || !$hmm || !$len);

my $factory = Bio::Tools::Run::Hmmer->new('program' => 'hmmsearch',
                                             'hmm'     => $hmm,
                                             'cut_tc'  => '');

## get chains
my %seqs = get_chains();
## get hmm hits
## print

First, we load up the required modules. Next we declare some variables – $infile, $hmm and $len. Each of these has to be specified on the command line and if they aren’t found, the script will die with an error message. In our example to find chains of 30 residues or less associated with a protein kinase, we’d type:

./ --seqfile pdbaa --hmm Pkinase.hmm --chainlen 30

On the last line we create a Bio::Tools::Run::Hmmer object. This tells the script that we want to run the program hmmsearch, using the input HMM and an option called “cut_tc”; this stands for trusted cutoff and means that hits below a certain threshold are rejected.

Next, we call 3 subroutines called get_chains(), get_hmmhit() and print_output(). Here’s get_chains():

sub get_chains {
my %seqs;
my $seqio = Bio::SeqIO->new('-file' => $infile, '-format' => 'fasta');

while(my $seqobj = $seqio->next_seq) {
    if($seqobj->id =~/pdb\|(.*?)\|(\w+)/) {
        $seqs{$1}{$2} = $seqobj->seq;
return %seqs;

What we do here is create a Bio::SeqIO object from the fasta file and loop through each sequence object in turn, looking at the ID. All of the IDs look something like this:


In this case, our regex will assign the PDB ID, 1IHO, to $1 and the chain, A, to $2. All PDB IDs have at least one chain, A. We set up a hash, %seqs where the first key is the PDB ID, the second is the chain and the value of the second key is the chain sequence. Hash keys have to be unique so in this way, we store all chain IDs under the PDB ID key and all sequences under the chain ID key.

When we run the script it spews warnings like this one:

-------------------- WARNING ---------------------
MSG: Got a sequence with no letters in it cannot guess alphabet []

It means exactly what it says and so far as I know, can’t be turned off. Don’t worry about it – you can always tack “2>STDERR” on the end of the command line to redirect the message.

Next, we want to search each sequence with the Pkinase HMM using hmmsearch. Here’s subroutine get_hmmhit():

sub get_hmmhit {
    for my $pdb (keys %seqs) {
	if(keys %{$seqs{$pdb}} >= 2) {
	for my $chain (keys %{$seqs{$pdb}}) {
    my $seq = Bio::Seq->new('-display_id' => $pdb.$chain,
 				    '-seq' => $seqs{$pdb}{$chain});
	    my $search = $factory->run($seq);
	    while(my $result = $search->next_result) {
		while(my $hit = $result->next_hit) {
		    $seqs{$pdb}{'hmm'}  = 'hit';
		    $seqs{$pdb}{$chain} = 'kinase';
	delete $seqs{$pdb} unless(exists($seqs{$pdb}{'hmm'}));

Here, we’re looping through the keys of %seqs. If we find >1 keys for the chains, we create a Bio::Seq object from the chain and run our HMMER search. This is a very nice Bio::Tools::Run feature – you can use a sequence object as input and get a Bio::Search object as output, without reading/writing files. If hmmsearch gave a hit, we tag the PDB ID by creating a hash key called ‘hmm’ with the value ‘hit’ and replacing the kinase sequence with the value ‘kinase’.

If there is only 1 chain or none of the chains contained a kinase catalytic domain, we delete the entire PDB entry from the hash because later on, we’re only interested in “a kinase chain plus something else”.

At this point we can “print Dumper %seqs” to see that all is well; here’s a portion of %seqs:

$VAR5 = '1CMK';
$VAR6 = {
          'hmm' => 'hit',
          'E' => 'kinase'
$VAR7 = '1UKH';
$VAR8 = {
          'A' => 'kinase',
          'hmm' => 'hit',
          'B' => 'RPKRPTTLNLF'

Nearly there. Now we just have to loop though again and print some results using the last subroutine, print_output():

sub print_output {
    my $out = "";
    for my $pdb (sort keys %seqs) {
	delete $seqs{$pdb}{'hmm'};
	    for my $chain (sort keys %{$seqs{$pdb}}) {
		if($seqs{$pdb}{$chain} ne 'kinase') {
		    my $length = length($seqs{$pdb}{$chain});
				     if($length <= $len) {
                                       $out .= "$pdb\t$chain\t$length\n";
    print $out;

We’re looping through %seqs again, looking at the PDB ID keys. First we delete the ‘hmm’ tag, to make sifting through the keys easier. Next, we look at the value of the chain key. If it’s ‘kinase’, we ignore it. If it’s not ‘kinase’ then it’s the sequence of the non-kinase chain(s). We get the length and if it’s less than or equal to what we specified as interesting the PDB ID, chain ID and chain length are appended to a variable, $out, which we print out at the end.

Final output looks like this:

1CMK	I	22
1DS5	E	23
1IR3	B	18
1JBP	S	20
1K3A	B	14
1LEW	B	12
1O6K	C	10
1O9U	B	18
1QMZ	E	7
1RQQ	E	18
1UKH	B	11
1WBP	B	9
2B9H	C	14
2BIL	A	14
2CPK	I	20
2G1T	E	13
2G2F	C	11
2GPH	B	16
2JAM	D	6
2PHK	B	7

We now have a list of chains less than 31 residues long which are associated with a protein kinase chain. We don’t know what they are – they could be substrate analogues, inhibitors or fragments of other subunits or proteins. However, at least we’ve cut the original fasta file down from over 30 000 sequences to a list of 20 candidates. It’s now easy enough to visit the PDB website and see what they are.

2 thoughts on “Parse the PDB using Bioperl

  1. Odd that you got the “MSG: Got a sequence with no letters in it cannot guess alphabet []” warning when parsing the fasta output. I’ll have to look into that!

    If you want to turn warnings off for any bioperl class you can set the object to $obj->verbose(-1), or pass it to the constructor:

    my $seqio = Bio::SeqIO->new(‘-file’ => $infile, ‘-format’ => ‘fasta’, -verbose => -1);

    You can also have it throw on warnings for ‘strictness’:

    my $seqio = Bio::SeqIO->new(‘-file’ => $infile, ‘-format’ => ‘fasta’, -verbose => 2);

    or spit out debugging code if it is present:

    my $seqio = Bio::SeqIO->new(‘-file’ => $infile, ‘-format’ => ‘fasta’, -verbose => 1);

  2. Thanks Chris. I’ve been happily catching thrown exceptions for years, but was never aware of the -verbose switch!

    It seems some of the entries in pdbaa really are devoid of letters for some reason, so don’t worry about that too much.

Comments are closed.