Parsing – the act of ripping through a file, pulling out the relevant parts and doing something useful with them, is an integral part of bioinformatics. It can be a dull procedure. It can also be challenging, requiring creativity and imagination. Frequently as a bioinformatician, you will generate output from an unfamiliar program, or a colleague will bring you a file that you haven’t encountered. Your task is to figure out how the file is structured, which regular expressions are required to parse it, what kind of output to produce and most importantly, how to handle those rogue files which don’t obey the rules.
Here’s my top ten (language-agnostic) parsing tips, focusing only on non-XML text files.
- Search for an existing parser…
- …but don’t trust it!
- Obtain a small sample output file
- Examine the file contents
- Visualise the desired output
- Simplify the file where possible
- Design regular expressions
- Imagine the exceptions
- Test, then scale up
- Make your parser reusable – by you and others
Most of us don’t have time to reinvent the wheel. First step: see if anyone else has solved the problem. The Bio* projects (bioperl, biopython, bioruby, biojava), for example, all include libraries to read and parse many common file formats: sequences, BLAST output and so on. There may be some effort involved in reading the documentation and learning to employ the methods, but it’s generally worth it.
On the other hand – don’t rely blindly on third-party code. You may find that there isn’t a method to do exactly what you want. Or there may be bugs, known or otherwise, in the library. Again, the solution is to know your library: search bug trackers, mailing lists and forums for known issues and submit any bugs that you find to the developers.
Start with the smallest complete example output file that you can find. You don’t want to wait minutes/hours to test your code just because 1 GB of output is being read into memory.
Open up the file in an editor or even a pager and just have a good look at it. Where is the information that you want to extract? Is it in any way delimited: tabs, commas, spaces? Hint: on Linux, the command “hexdump -c filename | less” will show you delimiters and line endings. Is there anything unique about what you want to extract: all letters, all numbers, preceded by a colon? Can it occur zero, one or more than one times in the file? Start to jot down a few regular expressions.
In the ideal case, your file format will have some kind of described schema. Example: the fasta sequence format, according to NCBI.
Draw out a simple flowchart: inputs, processes, outputs, then figure out what your final output should look like. It may be something simple: “I want the sequence ID, sequence length, molecular mass and isoelectric point as comma-separated values”. Or more complex: you may want to pass the values on to a second procedure; generating an image, or updating a database table. In summary: figure out what you require from the input to get to the output.
It’s often the case that whilst a lot of a file consists of free, unstructured text, the relevant parts have some structure to them. An example would be the output from programs in the CCP4 package, which often contains space-delimited fields that describe protein chains, residues, atoms and interactions. Often, a file can be simplified using tools such as grep, awk and sed to extract the relevant lines. In fact, you may even find that these tools completely satisfy your parsing requirements.
Goes without saying really, but mastery of regular expressions in your language of choice is a real time saver.
This is perhaps the most difficult step in the procedure. It’s tempting to assume that all output from a given program will look the same, but there are multiple reasons why it may not; not least of which is human intervention between file generation and passing it on to you. So look at your code and ask: what if? What if the header line in a fasta sequence did not start with “>”? What if an amino acid residue is not one of the 20 standard 1-letter abbreviations? What if these fields were not separated by a space? If your code looks for a pattern, it should also fail gracefully and informatively should that pattern not be found.
Here, I raise my hand to admit that this is an area in which I need to improve.
Get your code working on your small test file. Next, obtain more output files and try 10…then 50…then 100. Don’t get discouraged when your code (inevitably) breaks. Just make sure that you have a debugging procedure which at the very least, prints the name of the rogue file and if possible, the line number, so you can figure out quickly what’s gone wrong.
Assuming (in Linux) that you have a directory named input full of input files, you can feed the first 10 to your script like so:
for i in $(ls input | head -10); do my_script input/$i; done
Parsing a particular file type is almost certainly something you’ll need to do again and again, so make the code reusable. This means packaging it up as a class, module or whatever is applicable to the language. Better still – make it publicly available on the web, so as others can start at step (1) and build upon your work.