How to: create bioinformatics code from a published article

It’s been a while since I posted a “how to” article – and now is a good time, as Bio::Blogs #11 is going to feature a special tips and tricks section.

So today, I’d like to illustrate how to create some useful code for bioinformatics analysis by reading a scientific paper.

Our starting point is:

Uversky, V.N., Gillespie, J.R. and Fink A.L. (2000)
Why are “natively unfolded” proteins unstructured under physiologic conditions?
Proteins 41(3): 415-427
PubMed | Full text

Not open access unfortunately, so you might not be able to read it. Here’s a quick summary: the authors defined natively unfolded proteins as those that lack ordered structure at pH 7 in vitro. They analysed 91 such proteins and found that they tend to have a low overall hydrophobicity and large net charge, compared with folded proteins. In fact, they even derived a simple linear equation that divides proteins into natively unfolded or folded based on hydrophobicity and charge.

It’s an interesting paper but doesn’t give us many clues as to precisely how their analyses were carried out – there’s no software to download, for example. So there’s a nice little project – figure out what they did, implement it in the language of your choice, apply it to lots of proteins in various biological contexts.

Step 1: figure out what they did

First, we read the materials and methods. Here are the key sentences:

  • “The hydrophobicity of individual residues was normalised to a scale of 0 to 1 in these calculations”
  • “The mean hydrophobicity is defined as the sum of the normalized hydrophobicities of all residues divided by the number of residues in the polypeptide”
  • “The mean net charge is defined as the net charge at pH 7.0, divided by the total number of residues”

So far, so good. We can calculate the charge at pH 7.0 using an EMBOSS program called iep. Everything else we can code in whatever we like – I’ll be using Perl. Hang on though – let’s risk the wrath of Wiley’s lawyers and take a look at Figure 4 from the paper. We might expect some proteins to have a negative mean net charge – but we only see values greater than zero in this paper. Eventually I figured out that what they are referring to is absolute net charge – for instance, a charge of -10 becomes 10.

Step 2: Start coding!
We’ll be writing a Perl script that uses a bit of Bioperl. I’ve called it “” and the complete script is online for your inspection.

Define modules and variables
First, we define the modules that we need and a few variables:

use strict;
use Bio::SeqIO;
use IO::String
use Data::Dumper;

my $infile = shift || die("Usage =  <protein fasta file>\n");

my %hydro = ('A' =>  1.800, 'C' =>  2.500, 'D' => -3.500, 'E' => -3.500, 'F' =>  2.800,
	     'G' => -0.400, 'H' => -3.200, 'I' =>  4.500, 'K' => -3.900, 'L' =>  3.800,
	     'M' =>  1.900, 'N' => -3.500, 'P' => -1.600, 'Q' => -3.500, 'R' => -4.500,
	     'S' => -0.800, 'T' => -0.700, 'V' =>  4.200, 'W' => -0.900, 'Y' => -1.300

Always use strict. Bio::SeqIO is for handling sequence data. IO::String allows us to read strings as though they were files – we’ll see why we need this a little later. Data::Dumper is used only for debugging – I find it useful to store variables in data structures and print them out for inspection as I work. We also have a hash, %hydro, that defines the hydrophobicity value for each amino acid. These are standard biochemical values – if you have the NCBI data directory somewhere on your machine, you can find them in a file called “KSkyte.flt”. More negative = less hydrophobic and vice versa.

We’ll run our script by typing “perl myfile.fa”, where “myfile.fa” contains protein sequences in fasta format. If we omit the file name, the script dies and tells us the correct syntax.

OK. Now we’re going to define 5 subroutines, which we’ll call in order like this:

## subroutines
# normalise hydrophobicity values +ve
my %normal = normalise_hydro();
# create hash for sequence data
my %seqs   = get_seqs();
# calculate mean hydrophobicity of protein
# calculate mean charge of protein
# print output

1. Normalise the hydrophobicity values
Here’s the first, named normalise_hydro():

sub normalise_hydro {
my %normal;

foreach (sort keys %hydro) {
    my $norm = ($hydro{$_} + 4.5)/9;
       $normal{$_} = $norm;
return %normal;

Recall from the paper that hydrophobicity values were normalised to lie between 0 and 1. That’s what this code does – the minimum value in %hydro is -4.5, so we add 4.5 to get to zero. The range is -4.5 to 4.5 = 9, so that’s what we divide by to get things between 0 and 1. We store the results in a new hash called %normal and return it. OK – we could have just precalculated the corrected values and put them into %hydro – but this is a good illustration of converting from the paper to code.

2. Process the input sequences
Next up, we have get_seqs():

sub get_seqs {
my %seqs;
my $seqio = Bio::SeqIO->new('-file' => $infile, '-format' => 'fasta');
my ($id, $seq, $len);
while(my $seqobj = $seqio->next_seq)  {
    eval {
	$id  = $seqobj->id;
	$seq = $seqobj->seq;
	$len = $seqobj->length;
    if($@) {
	print "Problem with an input sequence - check fasta file\nException was:\n$@";
    elsif($seq =~/[BJOUXZ]/) {
	print "Problem with sequence $id: bad alphabet\n";
else {
     $seqs{$id}{'seq'} = $seq;
     $seqs{$id}{'len'} = $len;
return %seqs;

This is a little more involved. We read in the fasta file specified on the command line and convert it to a Bio::SeqIO object. Next, we loop through each sequence object and use Bioperl methods to get the sequence ID, sequence and length. We use eval{} to check that everything is OK. If it isn’t, the Perl special variable $@ catches it and tells us the problem. Otherwise we store the sequence and its length in a hash, %seqs, using the ID as the hash key. The thing to note here is the declaration of the variables outside the eval{} so as they are available to it.
We also use a regular expression to check that the sequence contains only the 20 standard amino acid alphabet. This is because the iep program used later on will throw errors otherwise.

3. Calculate mean hydrophobicity for each protein
Our next subroutine is named mean_hydro() and as the name suggests, calculates mean hydrophobicity for each protein as defined in the paper:

sub mean_hydro {
    for my $id (keys %seqs) {
	my $sum_hydro  = 0;

for my $key (sort keys %normal) {
    while($seqs{$id}{'seq'} =~/$key/g) {
	  $sum_hydro += $normal{$key};
	$seqs{$id}{'meanHydro'} = $sum_hydro/$seqs{$id}{'len'};

For each protein in %seqs, we get the sequence, $seqs{$id}{‘seq’}. Then, we loop through %normal, matching the amino acid residue one-letter codes to the sequence wherever they occur, using a regular expression match. We sum up the hydrophobicity values over the whole sequence, $sum_hydro and finally, divide that by the number of residues, $seqs{$id}{‘len’} to get the mean hydrophobicity of the protein. That value is stored in %seqs, under the appropriate ID key with its own key, meanHydro.

It’s not coffee time just yet. We still need to calculate the mean net charge.

4. Calculate mean net charge for each protein
Now, Bioperl does contain modules that run EMBOSS applications. They are not very sophisticated in that they generally require files as input, unlike other Bioperl modules that can use sequence objects. We don’t want to be bothered writing our sequences to a file, running iep, writing the output to another file and reading it in, so we’ll use a little trick. Our next subroutine is called mean_charge():

sub mean_charge {
for my $id (keys %seqs)  {
    my $charge = 0;
    my $iep = `echo $seqs{$id}{'seq'} | iep -filter`;
    my $io  = IO::String->new($iep);
    while(<$io>) {
    if(/^\s+(7\.00)\s+(.*?)\s+(.*?)$/) {
	$charge = abs($3);
	$seqs{$id}{'meanCharge'} = $charge/$seqs{$id}{'len'};

Once again, we start by looping through the ID keys in %seqs. The trick is in line 4 of this subroutine. The iep program has a switch named “-filter” which means rather than read/write files, use STDIN as input and write STDOUT as output. So we can just pipe our sequence, $seqs{$id}{‘seq’}, to iep then capture the output using the Perl backtick operator and assign it to a variable, $iep.
Our next trick is to convert $iep to an IO::String object, called $io. This enables us to read the output one line at a time, as though the string were read from a file and match each line to a regular expression. There are other ways to do this – for instance we could redefine the input record separator, $/, so as to read all of $iep in one go – but this is a nice demo of IO::String.

The iep program generates lines of output showing the calculated charge on the protein from pH 0 to 14 at 0.5 unit intervals, like this:

   pH     Bound    Charge
  6.50    66.50    10.50
  7.00    62.29     6.29
  7.50    59.35     3.35

We want to know the charge at pH 7.0, so we use a regular expression to look for that line. The last column, $3, is assigned to a variable, $charge. Actually – we use abs($3) – remember, we want absolute charge. Finally, we divide by sequence length to get mean net charge and store that in our hash, %seqs, under the key meanCharge.

5. Print output
All that remains is to loop through %seqs and print out the sequence ID, mean hydrophobicity and mean net charge. Bring on the last subroutine, print_output():

sub print_output {
my $out = "";
    for my $id (sort keys %seqs) {
	$out .= "$id\t".
	        sprintf("%.3f", $seqs{$id}{'meanHydro'})."\t".
	        sprintf("%.3f", $seqs{$id}{'meanCharge'})."\n";
open OUT, ">uversky.txt";
print OUT $out;
close OUT;

print "Output in file uversky.txt\n";

Not much to talk about here – we use sprintf() to round values to three decimal places and print out the data in a 3-column tab-delimited format: sequence ID, mean hydrophobicity, mean net charge. We print to a filehandle OUT so as any error messages don’t get mixed in with the output data.

It looks like this (for a few test cases from UniProt):

MERTK_HUMAN     0.486     0.019
GSK3B_MOUSE     0.465     0.025
ARBK1_HUMAN     0.448     0.004
ARBK2_RAT       0.452     0.008
CDC2_HUMAN      0.469     0.013
CDC2_MOUSE      0.473     0.014

So – what do we do with it now?
There are plenty of things that we can do with the output of this script. Recall from the paper that a linear function separates natively unfolded from folded proteins. The equation derived by the authors is:

R = 2.785 * H – 1.151

where R is mean net charge and H is mean hydrophobicity. So for each of our values of H, we can calculate a predicted value of R. If the real value of R is greater (above the line), we predict that the protein is natively unfolded, if less we predict that it is folded. Here’s a modified print_output() subroutine to demonstrate:

sub print {
my $out = "";
    for my $id (sort keys %seqs) {
	my $predCharge = 2.785*$seqs{$id}{'meanHydro'} - 1.151;
	$out .= "$id\t".
	        sprintf("%.3f", $seqs{$id}{'meanHydro'})."\t".
	        sprintf("%.3f", $seqs{$id}{'meanCharge'})."\t";

	if($seqs{$id}{'meanCharge'} - $predCharge > 0) {
	    $out .= "nu\n";
	else {
	    $out .= "f\n";
open OUT, ">uversky.txt";
print OUT $out;
close OUT;

print "Output in file uversky.txt\n";

uversky.pngTab- (or other) delimited data is great for lots of applications – particularly plotting. We can create something like Figure 4 from the original paper using the output file uversky.txt – it’s all of two commands using gnuplot:

gnuplot> plot 'uversky.txt' using 2:3
gnuplot> replot f(x) = x*a-1.151, a=2.785, f(x) with lines

All that remains now is to think of some interesting test cases. Comparison of homologous sequences from a range of organisms? Proteins from bacteria that grow at different temperatures? Whole genome analyses? It’s up to you – and more importantly, your biologist colleagues, to come up with research ideas.

I think this little script nicely demonstrates several key ideas:

  • How to extract data from a journal article and convert it to code
  • Using a toolkit or library (Bioperl)
  • Understanding the pitfalls of input, processing and output (parsing, potential errors, command-line switches, regex)
  • Generating output that can be used in many ways (delimited plain text)
  • The power and speed of open-source tools such as gnuplot

Hope you find it useful – and let me know if you spotted any errors.

7 thoughts on “How to: create bioinformatics code from a published article

  1. Nice demonstration, thanks !
    Does your interpretation of hydrophobicity and charge lead to the same conclusions as those used in the paper? Your figure looks quite different from their figure 4, but maybe you did not include so many natively unfolded proteins in your example set.

  2. Your figure looks quite different from their figure 4

    Ah – what I failed to make clear is that I wasn’t trying to recreate Figure 4 – just something that looks like it. I just grabbed a random fasta file (209 protein kinases) from my hard drive and ran it through the script. The data in the paper was different in that whether the proteins are natively unfolded or not is known experimentally.

    Which raises an interesting point – validating the findings of a paper can be time-consuming. Without supplementary files such as the sequences used, you normally have to go and get them manually from a database.

  3. One thing to consider when validating results is that there really is no one “hydrophobicity” measurement — unless the paper (which I can’t access) specifically said they used Kyte-Doolittle, I wouldn’t make the assumption that they did.

    Back when I was studying thermostability, I used to use hydrophobicity scales a lot — AAINDEX includes practically a dozen hydrophobicity/hydropathicity scales and many aren’t correlated (not even negatively).

  4. there really is no one “hydrophobicity” measurement
    Yes, I made hydrophobicity scales sound simpler and more standard than they are in reality. From Materials and Methods: “The hydrophobicity of each amino acid sequence was calculated by the Kyte and Doolittle approximation”. Remind me to use an open access paper for the next demonstration.
    AAINDEX is a good resource – the values used in the post correspond to entry KYTJ820101.

  5. Pingback: Bio::blogs 11 - special edition

Comments are closed.