Not as many structures as you might think

In the midst of preparing a talk for next Monday. It occurred to me that perhaps we don’t see more protein structure-based prediction in bioinformatics because – there aren’t enough structures.



Sure, the PDB has grown a lot in the past 5 years or so and 53 103 structures (as of now) looks impressive. However, if you’re interested in protein-protein interaction, you want at least 2 chains: which more or less halves the dataset. If you want two different protein chains, you lose almost another 75%. Let’s specify a reasonable minimum resolution for X-ray diffraction data and there go ~ 3 000 entries. We probably don’t want multiple, similar proteins so let’s remove sequence identity at a redundancy of 90%. We’re left with about 2% of the original PDB, which might be useable for looking at interactions.

No wonder that most bioinformatics focuses on sequences and high-throughput interaction data.

Rapid command-line access to the PDB

This is hardly earth-shattering stuff, but just for reference.

There are multiple ways to grab PDB files from the RCSB PDB servers. If you know the accession code of a structure, the simplest way is wget (or similar) straight off the FTP or HTTP server:



where XXXX is the 4-character PDB accession code.

Note the recent change of URL for the PDB archive: Note also the confusing 2, not 3 “w” in the URL.

The Web as science communication platform: two more signs

  1. People are finding many outlets for their work. Pierre maintains a repository of tools where you can find IBDStatus, his latest software for genetic analysis.
  2. Spotted in Nature this week:


Makes perfect sense doesn’t it: if you publish an article on a structure, include a link to the PDB resource. Yet so far as I can tell this is a new feature, since it jumped out at me. Given that the WWW is such a rich publishing platform, simply because of hyperlinks that connect data, how long before paper copies of all journals are considered quaint and obsolete?