Managing structural genomics data

I recently discovered the website of the TB Structural Genomics Consortium. It’s very nicely designed and cross-referenced, allowing users to navigate the TB genome and rapidly locate information about protein structures.
The reason that I mention it is that the lab where I work is involved with a structural genomics project and they desperately need a way to manage their data. At the moment it’s a bunch of files in the group Wiki, which is clearly inappropriate for structured data of this nature. I’m resisting the urge to code up a MySQL schema and PHP frontend in the hope that someone else has already done so. I wonder if the software that drives the TB website, or other SG sites, is available? There seems to be a market for this kind of application.

12 thoughts on “Managing structural genomics data

  1. I am surprised that the PSI is not driving some kind of standardization approach in collaboration with RCSB. If I recall correctly, publishing those structures at RCSB is one of the requirements for PSI centers (at least it used to be).

  2. Yes, you would think that there would be some standardisation here. There is a logical workflow to SG projects: (1) characterise protein using bioinformatics, (2) PCR, (3) clone and sequence, (4) express and purify, (5) crystallise and structure. At each stage we obtain certain types of data (sequences, chromatograms, yields) and a yes/no status (sequence is correct, protein is soluble, crystals obtained and so on). It’s very simple, structured data that could easily be managed in a standard way with SQL schemas and queries.

  3. Pingback: Memomics » Blog Archive » Managing structural genomics data

  4. I am no crystallographer, but steps 4 and 5 are historically considered more art than science. I am sure you’ve heard the “I sneezed into sample and got crystals after having no luck for 2 years” stories. Given that, one would think that being able to store the procedures that worked in an appropriate way would yield greater reproducibility, etc. What do your structural biology colleagues have to say about this?

  5. steps 4 and 5 are historically considered more art than science

    Well, yes and no. In terms of SG projects, the key things are target selection and standardisation. Bioinformatics can’t really predict likelihood of success regarding expression or crystallisation, but it can provide some clues. For example, we can eliminate putative membrane proteins as targets and look for things like rare codons which might lead to low expression.
    Once targets are selected, they’re cloned, expressed and purified using a few variations of standard methods and if sufficient pure protein is obtained, standard crystallisation screens are employed. The key thing in SG projects is high throughput so basically, anything that proves recalcitrant is discarded (unless the target is deemed so important that it justifies extra effort). The hope would be that the standards employed will work for a reasonable number of proteins (perhaps tens or hundreds out of an initial target set of thousands).

    There are a few databases around which contain experimental data such as purification and crystallisation methods, to help other people design their experiments. I’m not sure how much use they get or if their existence is widely known.

  6. I might add that after the series of inevitable failures along each step of the structural genomics pipeline (expression, purification, crystallization, data collection…) the “thoughput” of structural genomics is not spectacularly “high”: around 2 structures per 100 clones tried. (You can arrive at this statistic by looking at public data on SG targets maintained by the PDB.)

    This reflects the fact that on the whole, protein has individual requirements for these procedures. They are a different kettle of fish in high-thoughput biology, compared to, say, sequencing of cDNAs.

  7. the “thoughput” of structural genomics is not spectacularly “high”

    Yes indeed. Also, the cost per protein is much higher than we would like. I was told yesterday that at the current average cost/solved protein from SG projects, a protein crystal is about 12 000x the cost of an equivalent flawless diamond crystal.

    A notable example of efficiency seems to be the Joint Center for Structural Genomics, who have crystallised 24% of the Thermotoga maritima proteome, apparently. Admittedly this organism is a thermostable bacterium, so perhaps less challenging than a complex eukaryote and I don’t have figures on cost/protein.

  8. Even that number (24%) is much lower than the original lofty goals of the PSI. The other part that hasn’t quite worked, somewhat validated by the recent decision by the PDB not to include computational models, was the plan to fill out the structure space using homology models.

    From the discussions that I have been part of, on predicting crystallization conditions, using informatics approaches, the general consensus has always been, that given what we know, the error rates are still too high for computation to provide any kind of predictivity that would be practical.

    One more note .. while the SG projects have often been focussed on the throughput, over the years the reality moved towards solving those structures that could be crystallized, and that has resulted in less unique folds being solved than the original goals.

    Despite all that, the PSI has been a boon, since withouth it, I suspect we would be some years away from getting to the kind of structural coverage that we see today without a lot of the methodologies that have arisen in the effort to solve structures fast.

  9. I agree with all of the last points. I was told that for the T. maritima proteome, they use an in-house homology modelling system which can deal with identities as low as 11% and has modelled 72% of the proteome. How “reliable” those models are, I couldn’t say. My impression is that even “good” models are only of any use for giving you an idea of backbone conformation.
    Bioinformatics certainly can’t predict crystallisation conditions. I guess there’s plenty of data but you don’t really have a negative dataset – the assumption being that everything could in principle crystallise, if only you had the right conditions, rather than that some proteins cannot be crystallised. I think there is some progress in indicating the likelihood of crystallisation based on simple parameters (size, pI, oligomeric state) but again, probably not practical yet.
    And as you say, PSI and SG projects have driven methodology and certainly increased the size of the PDB considerably in the last few years, which is a good thing.

  10. The best models are certainly usable, but you are quite right. One should not even try to use anything but the “best” models for something like structure-based drug design.

  11. Pingback: Structure prediction has a long way to go - The PDB says “no” to computational models at business|bytes|genes|molecules

  12. I would just add that good models can be used for predictions. Here in the lab we have been working on structure based prediction of protein interactions and we have managed to used (good) homology models to work on some of the predictions with equal success as with crystal structures. Also, even in the cases where there is a crystal structure know it is useful to try to model the sequence on similar structures to have a feel for the possible backbone variations.

Comments are closed.