Not as many structures as you might think

In the midst of preparing a talk for next Monday. It occurred to me that perhaps we don’t see more protein structure-based prediction in bioinformatics because – there aren’t enough structures.



Sure, the PDB has grown a lot in the past 5 years or so and 53 103 structures (as of now) looks impressive. However, if you’re interested in protein-protein interaction, you want at least 2 chains: which more or less halves the dataset. If you want two different protein chains, you lose almost another 75%. Let’s specify a reasonable minimum resolution for X-ray diffraction data and there go ~ 3 000 entries. We probably don’t want multiple, similar proteins so let’s remove sequence identity at a redundancy of 90%. We’re left with about 2% of the original PDB, which might be useable for looking at interactions.

No wonder that most bioinformatics focuses on sequences and high-throughput interaction data.

4 thoughts on “Not as many structures as you might think

  1. But don’t you remember? According to your “favorite” scientific editorial writer Gregory Petsko, structural biology has “jumped the shark”

    “The goal of filling in the fold catalog was quickly abandoned, not only because it was too difficult but also because it was certainly true that no one except perhaps a few bioinformaticists cared.”

  2. Abandoned ? Surely not. The burgeoning number of crystallization robots, synchrotron sources and neutron facilities say otherwise. Most of G. Petsko’s opinions are valid, and his observations reflect of the heady days at the beginning of the structure explosion, but I’ve not met a biologist yet who, when discovering that a structure is finally available for their favourite protein, suddenly gets a lot more interested in the PDB.

    As for your post, Nick, IMHO, The problem you highlight is really another case of the curse of dimensionality, or from an alternative viewpoint due simply to the combinatoric nature of interactions. Notwithstanding the large number of biologically relevant transient interactions that may have to be trapped with crosslinks, it’s always going to be harder to search for and observe all stable(ish) complexes than it is to deal with a single protein chain with its known cofactors/ligands.

    However, there’s sufficient data now to analyse at least one or two families… we (ok – the ‘bioinformaticists’) just need a few more families covered to model what might be going on at their peptide interfaces. (see Stein and Aloy, PLOS One 2008 ( for example).

  3. I’ve got to say I would be more impressed by structural genomics efforts if they had stayed the course with tackling the tough ones. Filling in the catalogue is exactly what is needed to make the kinds of thing Neil is talking about possible.

    More widely there is a problem that many of the techniques involved in tackling protein-protein complex structure are relatively low resolution (such as small angle scattering and/or EM) and so do not go into the PDB by default, even when a high resolution model can be built, which is a shame.

  4. Its an interesting situation!

    A recent look at a large set of non-redundant proteins, oligomers and protein complexes showed that there are approximately 5 times as many homo-oligomers as there are hetero-oligomers or complexes in the PDB. In my set there are 134 protein structures that contain 2 or more chains with at least one chain more than 30% different from the others and 683 protein structures that contain 2 or more chains which are mutually within 30% similarity. This is a stricter criterion for defining ‘different protein chains’ than is used above (if I follow correctly), which serves to emphasise the discrepancy.

    I think Jim is right. The implication being that we need to merge structural genomics with functional genomics (interactomics) to form the new field of ‘structural interactomics’.

    This type of work is being pioneered by people like Kiyoshi Nagai by combining targeted Y2H experiments with expression / purification protocols for protein structure prediction.

    Not only will this give us vastly more high quality data about PPIs at the atomic level, it should also make all steps of the structure determination pipeline easier!

    All hail ‘structural interactomics’! And please don’t forget my 10% fee in all grants that use that term ;-)

Comments are closed.