Proteins in the PDB that differ by one amino acid

February 2, 2012 / nsaunders

A question at BioStar: how to “return all pdb ids to a given one that differ only by one amino acid”?

My answer began: “I think it is not too much work to craft a solution using a few tools”, followed by some incomplete ideas. Let’s see if I was right.
Read the rest…

Interacting with bioinformatics webservers using R

September 8, 2011 / nsaunders / 7 Comments

In an ideal world, all bioinformatics tools would be made available via the Web as a web service with an API, as well as a standalone package to download for local use. This is rarely the case and sometimes, even where one or the other is available, factors such as cost come into play. So we resort to web scraping; writing code to interact with the code that lies behind a web server so as to submit queries, retrieve and parse results.

Normally, I’d use something like Ruby’s Mechanize library for this purpose. However, where the purpose is to retrieve delimited data for analysis using R, I figured it was time to try and achieve the entire process within R. So here’s how I used the RCurl and XML packages to interact with the WHAT IF server, which provides tools for the analysis of protein structure.
Read the rest…

Not as many structures as you might think

September 18, 2008September 23, 2008 / nsaunders / 4 Comments

In the midst of preparing a talk for next Monday. It occurred to me that perhaps we don’t see more protein structure-based prediction in bioinformatics because – there aren’t enough structures.

pdbstats

Sure, the PDB has grown a lot in the past 5 years or so and 53 103 structures (as of now) looks impressive. However, if you’re interested in protein-protein interaction, you want at least 2 chains: which more or less halves the dataset. If you want two different protein chains, you lose almost another 75%. Let’s specify a reasonable minimum resolution for X-ray diffraction data and there go ~ 3 000 entries. We probably don’t want multiple, similar proteins so let’s remove sequence identity at a redundancy of 90%. We’re left with about 2% of the original PDB, which might be useable for looking at interactions.

No wonder that most bioinformatics focuses on sequences and high-throughput interaction data.

What You're Doing Is Rather Desperate

Notes from the life of a [data] scientist

protein structure

Proteins in the PDB that differ by one amino acid

Interacting with bioinformatics webservers using R

Not as many structures as you might think