On parsing

Parsing – the act of ripping through a file, pulling out the relevant parts and doing something useful with them, is an integral part of bioinformatics. It can be a dull procedure. It can also be challenging, requiring creativity and imagination. Frequently as a bioinformatician, you will generate output from an unfamiliar program, or a colleague will bring you a file that you haven’t encountered. Your task is to figure out how the file is structured, which regular expressions are required to parse it, what kind of output to produce and most importantly, how to handle those rogue files which don’t obey the rules.

Here’s my top ten (language-agnostic) parsing tips, focusing only on non-XML text files.

  1. Search for an existing parser…
  2. Most of us don’t have time to reinvent the wheel. First step: see if anyone else has solved the problem. The Bio* projects (bioperl, biopython, bioruby, biojava), for example, all include libraries to read and parse many common file formats: sequences, BLAST output and so on. There may be some effort involved in reading the documentation and learning to employ the methods, but it’s generally worth it.

  3. …but don’t trust it!
  4. On the other hand – don’t rely blindly on third-party code. You may find that there isn’t a method to do exactly what you want. Or there may be bugs, known or otherwise, in the library. Again, the solution is to know your library: search bug trackers, mailing lists and forums for known issues and submit any bugs that you find to the developers.

  5. Obtain a small sample output file
  6. Start with the smallest complete example output file that you can find. You don’t want to wait minutes/hours to test your code just because 1 GB of output is being read into memory.

  7. Examine the file contents
  8. Open up the file in an editor or even a pager and just have a good look at it. Where is the information that you want to extract? Is it in any way delimited: tabs, commas, spaces? Hint: on Linux, the command “hexdump -c filename | less” will show you delimiters and line endings. Is there anything unique about what you want to extract: all letters, all numbers, preceded by a colon? Can it occur zero, one or more than one times in the file? Start to jot down a few regular expressions.

    In the ideal case, your file format will have some kind of described schema. Example: the fasta sequence format, according to NCBI.

  9. Visualise the desired output
  10. Draw out a simple flowchart: inputs, processes, outputs, then figure out what your final output should look like. It may be something simple: “I want the sequence ID, sequence length, molecular mass and isoelectric point as comma-separated values”. Or more complex: you may want to pass the values on to a second procedure; generating an image, or updating a database table. In summary: figure out what you require from the input to get to the output.

  11. Simplify the file where possible
  12. It’s often the case that whilst a lot of a file consists of free, unstructured text, the relevant parts have some structure to them. An example would be the output from programs in the CCP4 package, which often contains space-delimited fields that describe protein chains, residues, atoms and interactions. Often, a file can be simplified using tools such as grep, awk and sed to extract the relevant lines. In fact, you may even find that these tools completely satisfy your parsing requirements.

  13. Design regular expressions
  14. Goes without saying really, but mastery of regular expressions in your language of choice is a real time saver.

  15. Imagine the exceptions
  16. This is perhaps the most difficult step in the procedure. It’s tempting to assume that all output from a given program will look the same, but there are multiple reasons why it may not; not least of which is human intervention between file generation and passing it on to you. So look at your code and ask: what if? What if the header line in a fasta sequence did not start with “>”? What if an amino acid residue is not one of the 20 standard 1-letter abbreviations? What if these fields were not separated by a space? If your code looks for a pattern, it should also fail gracefully and informatively should that pattern not be found.

    Here, I raise my hand to admit that this is an area in which I need to improve.

  17. Test, then scale up
  18. Get your code working on your small test file. Next, obtain more output files and try 10…then 50…then 100. Don’t get discouraged when your code (inevitably) breaks. Just make sure that you have a debugging procedure which at the very least, prints the name of the rogue file and if possible, the line number, so you can figure out quickly what’s gone wrong.

    Assuming (in Linux) that you have a directory named input full of input files, you can feed the first 10 to your script like so:

    for i in $(ls input | head -10); do my_script input/$i; done
    
  19. Make your parser reusable – by you and others
  20. Parsing a particular file type is almost certainly something you’ll need to do again and again, so make the code reusable. This means packaging it up as a class, module or whatever is applicable to the language. Better still – make it publicly available on the web, so as others can start at step (1) and build upon your work.

8 thoughts on “On parsing

  1. “Here’s my top ten (language-agnostic) parsing tips, focusing only on unstructured (non-XML) text files.”

    Err… Maybe this is needlessly pedantic, but most non-XML text files in use by bioinformatics software are very much structured. If you want to parse genuinely unstructured text files, you need a natural language parser. See here for some tips on picking one:

    http://www.biomedcentral.com/1471-2105/8/24

    Andrew.

  2. Neil, great post. I might add two tips that I’ve found useful.

    First, expanding on #6, under Linux you can use the “cut” command to isolate certain columns from a data file. The syntax:
    cut -d [delimiter] –fields=2,4,6,etc. [input] >[output]

    Secondly, for quick command-line scripting with Perl, the perl -pe command will read through a file one line at a time and perform your perl operations of choice on that line before outputting it. For example, this command would convert a comma-separated file to a tab-delimited file:
    perl -pe ‘$_ =~ s/\,/\t/g;’ myFile.csv >myFile.tsv

    See also “perl -pi -e” which allows you to manipulate files in-place.

    Cheers,

    Dan Koboldt
    Massgenomics
    http://www.massgenomics.org

  3. Why the hell has no-one written an interactive parser generator? Just feed it the files and help it out by marking the regions of interest, field boundaries, etc. Like those image colouring apps where you mark out the regions of colour you expect.

    This would be a godsend for the bioinformatics community.

  4. why is no one starting a parser-repository (any language) for bioinformatics parsers? like a debian-repository. just a bunch of files with in index…

  5. I have large text files. ~16,000 lines.
    They’re output each time I analyze my structures.
    I’m a bridge engineer.
    I’m trying to find a program to “surf the text” and grab what I want.
    Sounds like grep or other #6 programs could be all I need.
    Any Input?

  6. DWarner – it depends mostly on what you want to match and whether it’s on one line. For a quick match to a line, grep is good. For a more complex match – e.g. match several patterns and print them out, you’d want awk or some other scripting language. If the data are delimited in some way (spaces, tabs, commas), tools to process columns e.g. cut/paste can be useful. Have a Google for those keywords and examples.

Comments are closed.