A Perl trick that I always forget

Found this buried away in the back of my mind. Let’s say you want to create an array with 20 elements, set to the same initial value. You might do this:

my @array = ();
for(my $i = 0; $i <= 19; $i++) {
    $array&#91;$i&#93; = "0.00";
Or you might recall the very useful <em>repetition operator</em>:

my @array = ("0.00") x 20

As they say, TMTOWTDI – but some ways are better than others.

Missing links: using SwissProt IDtracker in your code

The BioPerl Bio::DB::SwissProt module lets you fetch sequences from SwissProt by ID or AC and store them as sequence objects:

use Bio::DB::SwissProt;
my $sp = Bio::DB::SwissProt-&gt;new('-servertype' => 'expasy', 'hostlocation' => 'australia');
my $seq = $sp->get_Seq_by_id('myod1_pig');

If you obtained SwissProt identifiers from a database that hasn’t been updated for some time, you may find that the ID or AC has changed. For example at NLSdb, the ID from the example shown is given as “myod_pig”. In this case, BioPerl will throw an error like this:

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: id does not exist
STACK: Error::throw
STACK: Bio::Root::Root::throw /usr/local/share/perl/5.8.8/Bio/Root/Root.pm:359
STACK: Bio::DB::WebDBSeqI::get_Seq_by_id /usr/local/share/perl/5.8.8/Bio/DB/WebDBSeqI.pm:154
STACK: test.pl:3

SwissProt provides a web page named IDtracker to help you find new identifiers using old ones. Here’s how we can integrate the service into Perl.
Read the rest…

When scripts fail, check databases

Note to self: how many biological databases provide their release notes as an RSS feed?

One of my database mining scripts failed today; on reading the SwissProt What’s New page, I discovered why:

Changes concerning keywords (KW line)
Modified keywords:
    * Phosphorylation -> Phosphoprotein

More notes:

  • Is keyword a good way to mine SwissProt records?
  • The MOD_RES line changed recently too; clearly these changes can happen anytime, anywhere in a record
  • SwissProt PTMs also have AC and ID; to be incorporated in all records eventually?


Update 18/8/08: the code in this post is corrupted with smilies…working on it

Bioinformaticians like tabular data; plain ASCII text delimited by tabs, commas or whatever. In the past, I’ve written an awful lot of scripts that begin something like this:

open IN, $file;
  while( <in> ) {
    my @line = split("\t", $_);
## count fields in a line and check for header
    if( $#line == 6 && $line[0] ne "ID" ) {
## do something with fields
close IN;

Bad programmer! Why not use one of the many Perl modules for handling CSV files, such as Tie::Handle::CSV?
Read the rest…

Handy Perl variables: @- and @+

Regular expression matching is one of the key features that makes Perl a widely used language in bioinformatics. For example, we can match a peptide to a longer protein sequence like this:

my $pep = "MAGTAM";

while($seq =~/$pep/g) {
    print "Found a match\n";

Perl also uses a bunch of “special variables”, that are automatically assigned values when you perform certain operations. Two of these, @- and @+, contain indices representing the start and end of a regular expression match. Very handy for when you want to know the matching positions in the longer sequence. Note that you need to add one to the start position since indices start at 0, whereas sequences start at 1:

while($seq =~/$pep/g) {
    print "Found a match from ".($-[0]+1)." to ".($+[0])."\n";

See Perl regular expressions tutorial for more.

Where N is an arbitrarily large fraction approaching one

What’s N? It’s the fraction of time that bioinformaticians spend obtaining, formatting and getting raw data ready to use, as opposed to analysing it.

There’ll be a longer post on this topic soon. Suffice to say, I’ve spent the last month evaluating the performance of 5 predictive tools that are available on the web. To do this, a test dataset of 200 or so sequences had to be submitted to each one. Each tool generates a score for particular residues in the sequence. The final output, which is what I require to do some statistical analysis, looks something like this:

P08153  114     method     61.74   0
P08153  522     method     82.10   1

where we have a sequence UniProt accession number, a sequence position, the name of the tool used (method), a score and either 1 (a positive instance) or 0 (a negative instance).

Doesn’t look too hard, does it? Except that:

  • None of the web servers provide web services or APIs
  • None of them provide standalone software for download
  • Most of them don’t generate easily-parsed output (delimited plain text)
  • Most of them have limited batch upload and processing capabilities

The solution, as always, is to hack together several hundred lines of poorly-written Perl (using HTML::Form in my case) to send each sequence to the server, grab the HTML that comes back, parse it and write out text files in the format shown above.

That’s 3-4 weeks and 500 lines of throwaway code just to get the raw data in the right state for analysis

When I started out in bioinformatics, I used to joke that at least 50% of my time was spent just obtaining raw data and formatting files. Over the years, I’ve revised my estimate. It’s currently at around 80-90% and I’m not sure that it’s still a joke.

Why is this trend in the wrong direction? When does it become untenable? I’m starting to think that my job title should be “data munger”, not “research officer”. I wouldn’t mind if data munging was perceived as a skill in academia but when funding is results-based, it will only ever be seen as the means to an end. Which it is, of course.