Proteins in the PDB that differ by one amino acid

A question at BioStar: how to “return all pdb ids to a given one that differ only by one amino acid”?

My answer began: “I think it is not too much work to craft a solution using a few tools”, followed by some incomplete ideas. Let’s see if I was right.

I’m not sure that differences of one amino acid between PDB files makes sense, so we’ll look at differences between PDB chains.

1. Retrieve sequences of PDB chains in FASTA format from NCBI
This step is easy enough:

wget ftp://ftp.ncbi.nih.gov/blast/db/FASTA/pdbaa.gz
gunzip pdbaa.gz

2. Index the sequences
We want to index the PDB sequence file, so as we can quickly extract subsets of sequences from it later on. I am very old-fashioned, so I index using formatdb in the old BLAST (not BLAST+) package and retrieve using fastacmd:

formatdb -i pdbaa -o T
# Formatted 56014 sequences in volume 0

3. Cluster the sequences using CD-HIT
Using CD-HIT is also easy:

cd-hit -i pdbaa -o pdb90 -c 0.9 -n 5

4. Parse the cluster file to extract IDs for sequences in each cluster
The previous step generates the file of clusters, pdb90.clstr. Time for some quick and dirty Perl to parse it:

#!/usr/bin/perl -w 

use strict;
my %clusters;
my $key = "";

open IN, "pdb90.clstr";
while(<IN>) {
    chomp;
    if(/^>(Cluster.*?)$/) {
	$key = $1;
    }
    if(/gi\|(\d+)\|/) {
	push(@{$clusters{$key}}, $1);
    }
}
close IN;

foreach $key(keys %clusters) {
    if($#{$clusters{$key}} > 0) {
	my $f = $key;
	$f =~s/\s+/_/g;
	open OUT, ">$f.txt";
	foreach(@{$clusters{$key}}) {
	    print OUT "$_\n";
	}
	close OUT;
    }
}

That writes out files with names like Cluster_25185.txt, containing the GI for the sequence of each chain in the cluster, one per line. There are 9 025 of them. Example:

294979455
294979456

5. Generate new FASTA format files for the sequences in each cluster
Now we can use the files generated in the previous step to extract sequences for each cluster from the original, indexed PDB file:

find ./ -name "Cluster_*.txt" -exec fastacmd -d pdbaa -i {} -o {}.fa \;

9 025 FASTA files. Example:

>gi|294979455|pdb|3H0F|A Chain A, Crystal Structure Of The Human Fyn Sh3 R96w Mutant
GSMTGTLRTRGGTGVTLFVALYDYEAWTEDDLSFHKGEKFQILNSSEGDWWEARSLTTGETGYIPSNYVAPVD
>gi|294979456|pdb|3H0H|A Chain A, Human Fyn Sh3 Domain R96i Mutant, Crystal Form I >gi|294979457|pdb|3H0I|A Chain A, Human Fyn Sh3 Domain R96i Mutant, Crystal Form Ii >gi|294979458|pdb|3H0I|B Chain B, Human Fyn Sh3 Domain R96i Mutant, Crystal Form Ii
GSMTGTLRTRGGTGVTLFVALYDYEAITEDDLSFHKGEKFQILNSSEGDWWEARSLTTGETGYIPSNYVAPVD

6. Align sequences in each cluster to each other
In my original answer, I suggested needleall from EMBOSS. Not the smartest idea since when you align a sequence file to itself, every pairwise alignment is duplicated in the output. We’ll press on regardless and deal with the doubling-up later:

find ./ -name "Cluster*.fa" -exec needleall -asequence {} -bsequence {} -auto -aformat3 pair -sprotein1 1 -sprotein2 1 -outfile {}.aln \;

7. Parse the alignment files to find pairs that differ by 1 amino acid
Here’s a portion of the alignment output file Cluster_1630.txt.fa.aln, showing what we want to see – an identity of (length – 1) / length:

# Aligned_sequences: 2
# 1: 2BHPA
# 2: 1UZBA
# Matrix: EBLOSUM62
# Gap_penalty: 10.0
# Extend_penalty: 0.5
#
# Length: 516
# Identity:     515/516 (99.8%)
# Similarity:   516/516 (100.0%)
# Gaps:           0/516 ( 0.0%)
# Score: 2650.0

Bioperl has some nice modules to handle EMBOSS alignment formats, but we’ll stick with the kind of dirty, nasty Perl one writes when first starting out in bioinformatics. This code pulls out, from each alignment, the pair of PDB IDs, length, identity and number of gaps. If identity = (length – 1) / length, it prints out that information. Remember that a gap could create the difference of one amino acid. Also, there’s a check using a hash key to get rid of the duplicate pairs; i.e. if we’ve already seen pair A-B, we don’t print pair B-A.

#!/usr/bin/perl -w 

use strict;
my $file = shift;
my @vals;
my %pairs;

open IN, $file;
while(<IN>) {
    chomp;
    if(/^#\s+1:\s+(.*?)$/) {
	$vals[0] = $1;
    }
    if(/^#\s+2:\s+(.*?)$/) {
	$vals[1] = $1;
	$pairs{$vals[1].$vals[0]} += 1;
    }
    if(/^#\s+Length:\s+(.*?)$/) {
	$vals[2] = $1;
    }
    if(/^#\s+Identity:\s+(\d+)\/.*?$/) {
	$vals[3] = $1;
    }
    if(/^#\s+Gaps:\s+(\d+)\/.*?$/) {
	$vals[4] = $1;
    }
    if($#vals == 4) {
	if($vals[2] - $vals[3] == 1) {
	    unless(exists $pairs{$vals[0].$vals[1]}) {
		print join(",", @vals), "\n";
	    }
	}
	@vals = ();
    }
}
close IN;

We can run that over all alignment files:

find ./ -name "*.aln" -exec ./aln.pl {} > oneaa.csv \;

Are we there yet? Arguably yes – we can “return all PDB ids to a given one that differ only by one amino acid.” Just use grep:

grep 3BES oneaa.csv 
# an example with 1 gap
# 3BESL,1HIGA,139,138,1

grep 1QT8 oneaa.csv
# an example with no gaps, 1 amino acid different
# 131LA,1QT8A,164,163,0
# 1QT4A,1QT8A,164,163,0
# 148LE,1QT8A,164,163,0
# 1QT3A,1QT8A,164,163,0
# 1C6PA,1QT8A,164,163,0

For the interested, there are currently 12 912 pairs of PDB chains that differ by 1 residue. Of those, 1 914 pairs differ due to one gap in the alignment; the rest are due to 1 amino acid change.