On parsing

Parsing – the act of ripping through a file, pulling out the relevant parts and doing something useful with them, is an integral part of bioinformatics. It can be a dull procedure. It can also be challenging, requiring creativity and imagination. Frequently as a bioinformatician, you will generate output from an unfamiliar program, or a colleague will bring you a file that you haven’t encountered. Your task is to figure out how the file is structured, which regular expressions are required to parse it, what kind of output to produce and most importantly, how to handle those rogue files which don’t obey the rules.

Here’s my top ten (language-agnostic) parsing tips, focusing only on non-XML text files.

Search for an existing parser…

Most of us don’t have time to reinvent the wheel. First step: see if anyone else has solved the problem. The Bio* projects (bioperl, biopython, bioruby, biojava), for example, all include libraries to read and parse many common file formats: sequences, BLAST output and so on. There may be some effort involved in reading the documentation and learning to employ the methods, but it’s generally worth it.

…but don’t trust it!

On the other hand – don’t rely blindly on third-party code. You may find that there isn’t a method to do exactly what you want. Or there may be bugs, known or otherwise, in the library. Again, the solution is to know your library: search bug trackers, mailing lists and forums for known issues and submit any bugs that you find to the developers.

Obtain a small sample output file

Start with the smallest complete example output file that you can find. You don’t want to wait minutes/hours to test your code just because 1 GB of output is being read into memory.

Examine the file contents

Open up the file in an editor or even a pager and just have a good look at it. Where is the information that you want to extract? Is it in any way delimited: tabs, commas, spaces? Hint: on Linux, the command “hexdump -c filename | less” will show you delimiters and line endings. Is there anything unique about what you want to extract: all letters, all numbers, preceded by a colon? Can it occur zero, one or more than one times in the file? Start to jot down a few regular expressions.

In the ideal case, your file format will have some kind of described schema. Example: the fasta sequence format, according to NCBI.

Visualise the desired output

Draw out a simple flowchart: inputs, processes, outputs, then figure out what your final output should look like. It may be something simple: “I want the sequence ID, sequence length, molecular mass and isoelectric point as comma-separated values”. Or more complex: you may want to pass the values on to a second procedure; generating an image, or updating a database table. In summary: figure out what you require from the input to get to the output.

Simplify the file where possible

It’s often the case that whilst a lot of a file consists of free, unstructured text, the relevant parts have some structure to them. An example would be the output from programs in the CCP4 package, which often contains space-delimited fields that describe protein chains, residues, atoms and interactions. Often, a file can be simplified using tools such as grep, awk and sed to extract the relevant lines. In fact, you may even find that these tools completely satisfy your parsing requirements.

Design regular expressions

Goes without saying really, but mastery of regular expressions in your language of choice is a real time saver.

Imagine the exceptions

This is perhaps the most difficult step in the procedure. It’s tempting to assume that all output from a given program will look the same, but there are multiple reasons why it may not; not least of which is human intervention between file generation and passing it on to you. So look at your code and ask: what if? What if the header line in a fasta sequence did not start with “>”? What if an amino acid residue is not one of the 20 standard 1-letter abbreviations? What if these fields were not separated by a space? If your code looks for a pattern, it should also fail gracefully and informatively should that pattern not be found.

Here, I raise my hand to admit that this is an area in which I need to improve.

Test, then scale up

Get your code working on your small test file. Next, obtain more output files and try 10…then 50…then 100. Don’t get discouraged when your code (inevitably) breaks. Just make sure that you have a debugging procedure which at the very least, prints the name of the rogue file and if possible, the line number, so you can figure out quickly what’s gone wrong.

Assuming (in Linux) that you have a directory named input full of input files, you can feed the first 10 to your script like so:

for i in $(ls input | head -10); do my_script input/$i; done

Make your parser reusable – by you and others

Parsing a particular file type is almost certainly something you’ll need to do again and again, so make the code reusable. This means packaging it up as a class, module or whatever is applicable to the language. Better still – make it publicly available on the web, so as others can start at step (1) and build upon your work.

8 thoughts on “On parsing”

Duncan

September 9, 2008 at 08:39

Hi Neil, interesting post. What about using a parser generator? E.g. write a grammar, let the parser generator do the hard work for you?
Andrew Clegg

September 9, 2008 at 20:10

“Here’s my top ten (language-agnostic) parsing tips, focusing only on unstructured (non-XML) text files.”

Err… Maybe this is needlessly pedantic, but most non-XML text files in use by bioinformatics software are very much structured. If you want to parse genuinely unstructured text files, you need a natural language parser. See here for some tips on picking one:

http://www.biomedcentral.com/1471-2105/8/24

Andrew.
nsaunders

September 9, 2008 at 20:14

Yeah, that is pretty pedantic :)
You’re right though, unstructured isn’t a synonym for non-XML. Amended accordingly.
dkoboldt

September 10, 2008 at 02:36

Neil, great post. I might add two tips that I’ve found useful.

First, expanding on #6, under Linux you can use the “cut” command to isolate certain columns from a data file. The syntax:
cut -d [delimiter] –fields=2,4,6,etc. [input] >[output]

Secondly, for quick command-line scripting with Perl, the perl -pe command will read through a file one line at a time and perform your perl operations of choice on that line before outputting it. For example, this command would convert a comma-separated file to a tab-delimited file:
perl -pe ‘$_ =~ s/\,/\t/g;’ myFile.csv >myFile.tsv

See also “perl -pi -e” which allows you to manipulate files in-place.

Cheers,

Dan Koboldt
Massgenomics
http://www.massgenomics.org
James Casbon

September 11, 2008 at 02:12

Why the hell has no-one written an interactive parser generator? Just feed it the files and help it out by marking the regions of interest, field boundaries, etc. Like those image colouring apps where you mark out the regions of colour you expect.

This would be a godsend for the bioinformatics community.
max

September 11, 2008 at 08:00

why is no one starting a parser-repository (any language) for bioinformatics parsers? like a debian-repository. just a bunch of files with in index…
DWarner

September 18, 2008 at 02:22

I have large text files. ~16,000 lines.
They’re output each time I analyze my structures.
I’m a bridge engineer.
I’m trying to find a program to “surf the text” and grab what I want.
Sounds like grep or other #6 programs could be all I need.
Any Input?
nsaunders

September 18, 2008 at 08:55

DWarner – it depends mostly on what you want to match and whether it’s on one line. For a quick match to a line, grep is good. For a more complex match – e.g. match several patterns and print them out, you’d want awk or some other scripting language. If the data are delimited in some way (spaces, tabs, commas), tools to process columns e.g. cut/paste can be useful. Have a Google for those keywords and examples.