This is the story of how an open source project and a science communication tool combined to save the day.
1. November 2nd 2010
I receive an email from colleagues at my previous workplace. They are trying to publish some proteomics data and the journal has stipulated that raw data and “annotated peptide mass fingerprint spectra” must be made available.
The data that they have come from a machine called a Voyager-DE STR MALDI-TOF mass spectrometer. They are binary files with the suffix “.dat”. No-one is quite sure what to do with them. To plot the spectra we need a file that contains, as a minimum, the intensity and m/z ratio for each peak. Oh, that we had simple CSV files, or at least something in plain ASCII text.
2. November 5th 2010
I get to work. Some web searching leads me to this wiki from the Seattle Proteome Center. It seems that the Voyager format is especially problematic. There is only one available conversion tool, named PyMsXML. It’s written in Python and whilst open source, requires the proprietary vendor libraries to work. Furthermore, it has not been updated since 2007.
A colleague sends me the required, Windows-only software, named Data Explorer. With a heavy heart and a slight pang of nausea, I start up my rarely-used Windows XP installation as a virtual machine in VirtualBox. Installation only takes a couple of attempts, with a break in-between to run a registry cleaner…
3. November 6-8th 2010
I follow the instructions at the PyMsXML website. First, I need to install ActivePython from ActiveState. The latest 2.x version is 2.7. However, later instructions refer to a menu option named “COM Make py”, which I don’t see. Ah – the instructions seem to be based on Python 2.4. Let’s grab that then. Oh wait – it’s only available as part of the $999 Business Package. Pay for the oldest version? That doesn’t make sense. Well, there’s a free version 2.5 here, let’s try that. OK, that does seem to have the required utility.
Next – open the Python editor, go to “COM Make py” and look for libraries named “ExploreDataObjects 1.0 Type Library (1.0)” and “IDAExplorer 1.0 Type Library (1.0)” – the latter is for .dat files. Neither of those exist. However, there is a library named “Data Explorer 4.2 Type Library (4.2)”. Hmm – I guess Data Explorer has probably moved on since 2007. Well, I install the interface for the latter library without incident. There are a couple of tests to run; the first fails but is for Analyst files (yet another format, same vendor), so I don’t care about that. The second appears to pass.
OK – I grab the PyMsXML installation and do a bit of editing to get the PATH right. Where’s $PATH on Windows again? Buried away somewhere really bloody obscure, that’s where. The moment of truth: I get my Voyager file, myfile.dat and run:
pymsxml -R voyager -o myfile.mzXML myfile.dat
And I see the error message:
Traceback (most recent call last): File "C:\bin\pymsxml.py", line 1796, in <module> x.write(debug=opts.debug) File "C:\bin\pymsxml.py", line 83, in write self.write_scans(tmpFile,debug) File "C:\bin\pymsxml.py", line 300, in write_scans for (s,d) in self.reader.spectra(): File "C:\bin\pymsxml.py", line 1528, in spectra (tf,fixedMass) = doc.InstrumentSettings.GetSetting(self.delib.constants.dePr eCursorIon,i-1,None) AttributeError: class constants has no attribute 'dePreCursorIon'
I know just enough to realise that “dePreCursorIon” is probably a field in the older Data Explorer format that no longer exists, but I have no idea how to go about fixing it. So I muck around, briefly, with another tool named ProteoWizard, before realising that I’m completely wasting my time.
4. November 8th 2010 – late at night
I decide to ask for help at BioStar. It’s a rather technical issue, but I know that there are some great Python experts there and people who have worked with proteomic data. I spell out my rather long question as best I can and wait.
5. November 9th 2010
Next morning, I’ve received two very helpful answers. The first is from Brad Chapman, who I know as an excellent bioinformatician and Pythonista. He suggests a debug line to list the known data attributes, one of which I might be able to substitute for the obsolete “dePreCursorIon”.
Following his advice, I see a constant that looks promising, named deExtractedIonMass. I edit the line at the end of the error traceback to read:
(tf,fixedMass) = doc.InstrumentSettings.GetSetting(self.delib.constants.deExtractedIonMass,i-1,None)
Fingers crossed, I run pymsxml once more and…it exits without error. What’s more, there’s an output file, myfile.mzXML, which looks to be an appropriate size and – when I copy it back to an Ubuntu directory and examine using less, because I just can’t stand to be in Windows any more – has the appearance of a valid mzXML file.
One last check. More web searching (for something as simple as a tool that will actually view spectra in mzXML format). Installation of something named Insilicos (whatever). Load original .dat file in Data Explorer, load mzXML file in Insilicos and compare. Result (click for full-size versions):
Rejoice. Have a cup of tea. Know that you will never be able to explain fully to your colleagues just how ridiculous all of this is. You’ll just send them a short email that begins: “I’ve converted the files to mzXML…”
I’m not quite home yet; there’s still the matter of plotting the spectra and caMassClass is throwing errors. However, that should be child’s play, compared with the week-long saga required to – get this – convert a file format.
And that is why I’m an open source, online science kind of guy.