A nasty MOD_RES surprise

A lot of bioinformatics consists of fetching files in various formats from databases and writing parsers to extract features. What to do when one of your trusty parsers unexpectedly fails?

  1. Don’t panic
  2. Make sure that you haven’t done something silly:
    • did you inadvertently alter the code recently?
    • did you run a different version of the code by mistake?
    • did you use the correct file(s) as input?
    • does the machine that you’re using have the required libraries and software used by the parser?
  3. If your code or machine setup hasn’t changed, then the culprit must be the input file

Take a look at the file – use something like grep if possible to examine specific lines and see if their format has altered.

One of my more robust perl scripts is designed to examine the MOD_RES line in the feature table section of a SwissProt file for protein kinase names. Imagine my surprise when out of the blue, not a single name appeared in the ~50 000 line output file. A quick “grep MOD_RES file.dat | less” revealed this alteration:

FT   MOD_RES     353    353       Phosphoserine (by MAPK12 and MAPK9)

FT   MOD_RES     353    353       Phosphoserine; by MAPK12 and MAPK9.

Might be time to fix up your regexes if you have code that parses SwissProt format.