I often find that the most satisfying problems in bioinformatics are not related to a specific research project. They arise when someone has a query and you have a neat way to send them an answer, quickly. Take this email that I received yesterday:
I’m compiling a table of serine-threonine protein kinase structures. Can you take a look and see if anything is missing?
Strangely, I don’t carry a list of current PDB accessions for every protein kinase in my head. On the other hand if you ask me “can I extract all structures that are protein kinases from the PDB?”, the answer is yes, you can. So here’s what I did.
First of course, we need to get the sequences of all chains from the PDB database. Using Linux, as easy as:
wget ftp://ftp.ncbi.nih.gov/blast/db/FASTA/pdbaa.gz gunzip pdbaa.gz
hmmsearch --cut_tc Pkinase_ls pdbaa > pdb.pkinase.hmmer
That “cut_tc” switch is “trusted cutoff” – to return only hits most likely to represent the protein kinase catalytic domain.
This generates a long HMMER report. It would be nice if we could run hmmsearch, parse the report, extract the sequence IDs and generate a new fasta file of hit sequences. We can, using BioPerl and a script that I wrote named hmmer2fa.pl:
This creates a file named pdbaa.hits.fa. It’s a bit messy because the pdbaa fasta file contains huge descriptions in the header line that we don’t really need. Time for Perl script fasclean.pl, which rewrites the header to contain only the ID and reformats the sequence to 60 characters per line:
fasclean.pl pdbaa.hits.fa > kin.hmmer.fa
So we now have a file of chains from the PDB, in fasta format, containing only protein kinase sequences. There are 329 of them. Wait though – a lot of PDB chains are essentially the same sequence with slight variations in the structures. Let’s remove the redundancy using CD-HIT:
cd-hit -i kin.hmmer.fa -o kin90 -c 0.9 -n 5 -d 0 mv kin90 kin90.fa
That clusters the sequences into groups with at least 90% identity and outputs a cluster file (kin90.clstr) and a new fasta file, kin90.fa, containing representatives of only the non-redundant sequences. There are now 113 of them. That -d 0 switch is important; it specifies that sequences in the kin90.clstr file are identified using their complete fasta ID. We need to match that with the fasta file later on.
We’re almost there. All we have to do is write one more Perl script to do the following things: (1) take the files kin90.clstr and kin90.fa as input, (2) classify each of the 113 sequences into a kinase type and family, (3) apply the same classifications to the other members of each cluster and (4) write out a nice summary table containing each cluster with links to the NCBI sequence and PDB structure files. The Perl that does this is called “cdhit2table.pl” and I won’t share it with you just now.
Given that this is all achieved using the CLI and scripting, we could automate the process to update our HTML table at set intervals, providing an up-to-date resource on the Web. Now isn’t that more useful than a printed table in a musty old journal which becomes obsolete even as it’s being compiled?