How open source and BioStar saved a project

This is the story of how an open source project and a science communication tool combined to save the day.

1. November 2nd 2010
I receive an email from colleagues at my previous workplace. They are trying to publish some proteomics data and the journal has stipulated that raw data and “annotated peptide mass fingerprint spectra” must be made available.

The data that they have come from a machine called a Voyager-DE STR MALDI-TOF mass spectrometer. They are binary files with the suffix “.dat”. No-one is quite sure what to do with them. To plot the spectra we need a file that contains, as a minimum, the intensity and m/z ratio for each peak. Oh, that we had simple CSV files, or at least something in plain ASCII text.

2. November 5th 2010
I get to work. Some web searching leads me to this wiki from the Seattle Proteome Center. It seems that the Voyager format is especially problematic. There is only one available conversion tool, named PyMsXML. It’s written in Python and whilst open source, requires the proprietary vendor libraries to work. Furthermore, it has not been updated since 2007.

A colleague sends me the required, Windows-only software, named Data Explorer. With a heavy heart and a slight pang of nausea, I start up my rarely-used Windows XP installation as a virtual machine in VirtualBox. Installation only takes a couple of attempts, with a break in-between to run a registry cleaner…

3. November 6-8th 2010
I follow the instructions at the PyMsXML website. First, I need to install ActivePython from ActiveState. The latest 2.x version is 2.7. However, later instructions refer to a menu option named “COM Make py”, which I don’t see. Ah – the instructions seem to be based on Python 2.4. Let’s grab that then. Oh wait – it’s only available as part of the $999 Business Package. Pay for the oldest version? That doesn’t make sense. Well, there’s a free version 2.5 here, let’s try that. OK, that does seem to have the required utility.

Next – open the Python editor, go to “COM Make py” and look for libraries named “ExploreDataObjects 1.0 Type Library (1.0)” and “IDAExplorer 1.0 Type Library (1.0)” – the latter is for .dat files. Neither of those exist. However, there is a library named “Data Explorer 4.2 Type Library (4.2)”. Hmm – I guess Data Explorer has probably moved on since 2007. Well, I install the interface for the latter library without incident. There are a couple of tests to run; the first fails but is for Analyst files (yet another format, same vendor), so I don’t care about that. The second appears to pass.

OK – I grab the PyMsXML installation and do a bit of editing to get the PATH right. Where’s $PATH on Windows again? Buried away somewhere really bloody obscure, that’s where. The moment of truth: I get my Voyager file, myfile.dat and run:

pymsxml -R voyager -o myfile.mzXML myfile.dat

And I see the error message:

Traceback (most recent call last):
  File "C:\bin\pymsxml.py", line 1796, in <module>
    x.write(debug=opts.debug)
  File "C:\bin\pymsxml.py", line 83, in write
    self.write_scans(tmpFile,debug)
  File "C:\bin\pymsxml.py", line 300, in write_scans
    for (s,d) in self.reader.spectra():
  File "C:\bin\pymsxml.py", line 1528, in spectra
    (tf,fixedMass) = doc.InstrumentSettings.GetSetting(self.delib.constants.dePr
eCursorIon,i-1,None)
AttributeError: class constants has no attribute 'dePreCursorIon'

I know just enough to realise that “dePreCursorIon” is probably a field in the older Data Explorer format that no longer exists, but I have no idea how to go about fixing it. So I muck around, briefly, with another tool named ProteoWizard, before realising that I’m completely wasting my time.

4. November 8th 2010 – late at night
I decide to ask for help at BioStar. It’s a rather technical issue, but I know that there are some great Python experts there and people who have worked with proteomic data. I spell out my rather long question as best I can and wait.

5. November 9th 2010
Next morning, I’ve received two very helpful answers. The first is from Brad Chapman, who I know as an excellent bioinformatician and Pythonista. He suggests a debug line to list the known data attributes, one of which I might be able to substitute for the obsolete “dePreCursorIon”.

Following his advice, I see a constant that looks promising, named deExtractedIonMass. I edit the line at the end of the error traceback to read:

(tf,fixedMass) = doc.InstrumentSettings.GetSetting(self.delib.constants.deExtractedIonMass,i-1,None)

Fingers crossed, I run pymsxml once more and…it exits without error. What’s more, there’s an output file, myfile.mzXML, which looks to be an appropriate size and – when I copy it back to an Ubuntu directory and examine using less, because I just can’t stand to be in Windows any more – has the appearance of a valid mzXML file.

One last check. More web searching (for something as simple as a tool that will actually view spectra in mzXML format). Installation of something named Insilicos (whatever). Load original .dat file in Data Explorer, load mzXML file in Insilicos and compare. Result (click for full-size versions):

de

Spectrum, .dat file, Data Explorer

insilicos

Spectrum, mzXML file, Insilicos

Rejoice. Have a cup of tea. Know that you will never be able to explain fully to your colleagues just how ridiculous all of this is. You’ll just send them a short email that begins: “I’ve converted the files to mzXML…”

I’m not quite home yet; there’s still the matter of plotting the spectra and caMassClass is throwing errors. However, that should be child’s play, compared with the week-long saga required to – get this – convert a file format.

And that is why I’m an open source, online science kind of guy.

8 thoughts on “How open source and BioStar saved a project

  1. Really love the “story” behind this project Neil. Lots of drama but also a happy end. Sometimes I feel like one could write a small book about the odyssey of converting between different file formats.

    Additional “open-source points” for making the updated code available (github ?) in case other people also run into a similar predicament.

  2. Do you think it might have helped if you’d posted a sample data file and explanation here when you started? My first thought is that a Stack Exchange for “convert x to y”-type problems might be useful.

    • Generally, I don’t use this blog for questions. I turned to BioStar quite soon after “hitting the wall”.

      Agree, some sort of “conversion how-to” resource could be helpful.

  3. I used to work in proteomics about ten years ago and never could understand why every mass-spec had its own proprietary binary format. You couldn’t even count on machines by the same company to use the same format. Would it really be that hard to use text files?

    • I have never understood it either. I mean, you’ve already bought their machine and software, it’s not as though they have to lock you in to their system any further! Proteomics vendors are, I think, by far the worst offenders in bioinformatics when it comes to data formats.

  4. Pingback: Good anecdote on the problems of proprietary file formats | elnblog.com

Comments are closed.