This year, I’ve experienced what bloggers call – um, not blogging very much. One reason is that much of our conversation has moved to other services – notably FriendFeed. However, the main reason is that I have a day job: develop bioinformatics applications, perform research, publish articles, present talks and keep the boss happy. Read on for some “notes from the day job” – especially if protein kinases and their substrates are your thing.
When I arrived at UQ a couple of years ago, I inherited a project named Predikin. The goal of Predikin is to predict substrates for protein kinases, using structural features in the kinase catalytic domain. It’s a simple but effective idea: you look at the available structures of protein kinase-substrate peptide complexes and locate the kinase amino acid residues that determine how the peptide (XXX[STY]XXX) “fits” into the binding pocket. We call these residues SDRs. With enough examples of kinase sequences and known substrates, you can create a database that links SDRs to substrate peptides. You can then take a query kinase sequence, identify the SDRs, make a list of substrate peptides for kinases with similar SDRs and use that list to build a scoring matrix, which you then use to scan and score putative substrates for your query kinase.
My main task then, was to make things better. This involved:
- Rewriting Predikin: I chose Perl, mainly for the BioPerl libraries and wrote my first module, Predikin.pm, with a bunch of methods for analysing protein kinases and substrates
- Extensive use of HMMs: from SMART, for alignment of a query kinase with a model of the kinase catalytic domain to identify SDRs; from PANTHER and/or the Kinase Sequence Database, to classify a query kinase into a family
- Lots of MySQL: to build the backend database of kinases, substrates and peptides
- Introducing various filters (based, for example, on predicted disorder) to eliminate less likely substrate peptides from prediction
- Figuring out the best way to implement PFMs, PWMs and using them to score putative substrate sequences
- Building a new web interface – which explains why “running a background process using PHP” is consistently the most-viewed post on this blog
Having done all that, it was time to try and publish the work.
1. The webserver
This went quite smoothly. Am I happy with the webserver? Well, it works. If I were designing it again today, I’d do it quite differently – but that’s always the way. When the boss says “can we have the webserver ready for my India meeting in 2 weeks time”, you don’t get great software design. The combination of PHP talking to Perl via a PECL module is not something of which I’m proud – it seemed clever at the time, but makes maintenance a nightmare.
2. The details
Saunders, N.F.W., Brinkworth, R.I., Huber, T., Kemp, B.E. and Kobe, B. (2008).
Predikin and PredikinDB: a computational framework for the prediction of protein kinase peptide specificity and an associated database of phosphorylation sites.
BMC Bioinformatics 9:245. [Open Access] | [PubMed]
This did not go quite so smoothly. It was in review for the best part of a year, due to a variety of incidents with which I won’t bore you. However, I did learn an awful lot about ROC curves, using R and cross-validation during the process.
You’ll note that it ended up as a Software article, which troubles me because Predikin is not something that you could easily download, set up and run. Once again, the pressures of “publish or perish” lead to sub-optimal design and documentation. However, the paper is really about how Predikin works in terms of algorithms and components. I hope it succeeds from that point of view.
So there it is. If protein kinases and their substrates are your thing, give Predikin a go. If your needs go beyond the web interface, let us know – we’re always on the lookout for interesting collaborations and particularly, experimental data on which to test Predikin. I’ll be uploading some talks from the past couple of years to Slideshare in the near-future.
Next stop ISMB 2008 – but that’s another post.