Notes from the day job: published #3, #4, 2008

Posts 2008This year, I’ve experienced what bloggers call – um, not blogging very much. One reason is that much of our conversation has moved to other services – notably FriendFeed. However, the main reason is that I have a day job: develop bioinformatics applications, perform research, publish articles, present talks and keep the boss happy. Read on for some “notes from the day job” – especially if protein kinases and their substrates are your thing.

When I arrived at UQ a couple of years ago, I inherited a project named Predikin. The goal of Predikin is to predict substrates for protein kinases, using structural features in the kinase catalytic domain. It’s a simple but effective idea: you look at the available structures of protein kinase-substrate peptide complexes and locate the kinase amino acid residues that determine how the peptide (XXX[STY]XXX) “fits” into the binding pocket. We call these residues SDRs. With enough examples of kinase sequences and known substrates, you can create a database that links SDRs to substrate peptides. You can then take a query kinase sequence, identify the SDRs, make a list of substrate peptides for kinases with similar SDRs and use that list to build a scoring matrix, which you then use to scan and score putative substrates for your query kinase.

What I actually inherited was ~ 1800 lines of JavaScript and a “WWW circa 1996″ web interface. The JavaScript employed some rather simple “heuristic” rules of the form: if (kinase residue A = B) then (substrate residue C = D). Main problem: multiple cases where these rules did not hold. Hence the need for a database of known peptides, from which we can assign scores based on probability, rather than the old “if then else” rules.

My main task then, was to make things better. This involved:

  • Rewriting Predikin: I chose Perl, mainly for the BioPerl libraries and wrote my first module, Predikin.pm, with a bunch of methods for analysing protein kinases and substrates
  • Extensive use of HMMs: from SMART, for alignment of a query kinase with a model of the kinase catalytic domain to identify SDRs; from PANTHER and/or the Kinase Sequence Database, to classify a query kinase into a family
  • Lots of MySQL: to build the backend database of kinases, substrates and peptides
  • Introducing various filters (based, for example, on predicted disorder) to eliminate less likely substrate peptides from prediction
  • Figuring out the best way to implement PFMs, PWMs and using them to score putative substrate sequences
  • Building a new web interface – which explains why “running a background process using PHP” is consistently the most-viewed post on this blog

Having done all that, it was time to try and publish the work.

1. The webserver

Saunders, N.F.W. and Kobe, B. (2008).
The Predikin webserver: improved prediction of protein kinase peptide specificity using structural information.
Nucleic Acids Res. [Open Access] | [PubMed]

This went quite smoothly. Am I happy with the webserver? Well, it works. If I were designing it again today, I’d do it quite differently – but that’s always the way. When the boss says “can we have the webserver ready for my India meeting in 2 weeks time”, you don’t get great software design. The combination of PHP talking to Perl via a PECL module is not something of which I’m proud – it seemed clever at the time, but makes maintenance a nightmare.

2. The details

Saunders, N.F.W., Brinkworth, R.I., Huber, T., Kemp, B.E. and Kobe, B. (2008).
Predikin and PredikinDB: a computational framework for the prediction of protein kinase peptide specificity and an associated database of phosphorylation sites.
BMC Bioinformatics 9:245. [Open Access] | [PubMed]

This did not go quite so smoothly. It was in review for the best part of a year, due to a variety of incidents with which I won’t bore you. However, I did learn an awful lot about ROC curves, using R and cross-validation during the process.
You’ll note that it ended up as a Software article, which troubles me because Predikin is not something that you could easily download, set up and run. Once again, the pressures of “publish or perish” lead to sub-optimal design and documentation. However, the paper is really about how Predikin works in terms of algorithms and components. I hope it succeeds from that point of view.

So there it is. If protein kinases and their substrates are your thing, give Predikin a go. If your needs go beyond the web interface, let us know – we’re always on the lookout for interesting collaborations and particularly, experimental data on which to test Predikin. I’ll be uploading some talks from the past couple of years to Slideshare in the near-future.

Next stop ISMB 2008 – but that’s another post.

2 thoughts on “Notes from the day job: published #3, #4, 2008

  1. agbiotec

    The bad software design due to rush of getting things ready soon, and is eminent all over the bioinformatics field. I have experienced it personally, and in addition to your “publish or perish” reason which I completely agree, I add one more: many (probably the greater part) of P.I.s for grants in bioinformatics are biologists. This translates to a boss who knows minimal or even nothing about implementing software, has no notion of software usabilty, that a clean implementation needs time, testing, user feedback etc. Also we have to throw a portion of the responsibility to funding agencies, which again put mainly biologists on the committees, have again no idea about software, give a timeline for which the funded research has to be out and no standards requirement for how the software should be implemented. This to my humble opinion is the whole reason which we see so much of the development in bioinformatics being replicated efforts. A little bit of a corporate approach where the software is a product and if it’s bad it will not sell, I think would not hurt . Having interoperable software in bioinformatics is a whole ‘nother story – even the commercial software cannot work well with each other cause the development is closed behind each company’s walls. But the difference with commercial software is the quality of code written, primarily through procedures that test it, but also because not having a single grad student to build the whole thing, take classes, write thesis, oh and not forget, please the boss with the biological insights that he gets after he / she uses the software…

  2. Pingback: software quality in bioinformatics… « Web 2.0 and Semantic Web for Bioinformatics

Comments are closed.