Update 18/8/08: the code in this post is corrupted with smilies…working on it

Bioinformaticians like tabular data; plain ASCII text delimited by tabs, commas or whatever. In the past, I’ve written an awful lot of scripts that begin something like this:

open IN, $file;
  while( <IN> ) {
    my @line = split("\t", $_);
## count fields in a line and check for header
    if( $#line == 6 && $line[0] ne "ID" ) {
## do something with fields
close IN;

Bad programmer! Why not use one of the many Perl modules for handling CSV files, such as Tie::Handle::CSV?

First, create your new object:

my $csv = Tie::Handle::CSV->new($file, header => 1, sep_char => "\t");

Fairly self-explanatory. Now, loop through each record like so:

while( my $record = <$csv> ) {
## get the value of column ID
    print $record->{ID}, "\n";
## or do whatever with each field
   if( $record->{ID} eq "YAL001C" ) {...}

Here’s a sample delimited record (raw output from InterProScan with a custom header added):

SEQ     CRC     LEN     METHOD  ACC     ID      START   END     SCORE   T       DATE    IPR     IPRDESC
YAL001C E16ACB7893ED5365        1161    HMMPfam PF04182 B-block_TFIIIC  8       411     0       T       14-Sep-2007     IPR007309       B-block binding subunit of TFIIIC

And here’s the corresponding Tie::Handle::CSV object, displayed using Data::Dumper (“print Dumper $record”):

$VAR1 = bless( {
                 'SEQ' => 'YAL001C',
                 'CRC' => 'E16ACB7893ED5365',
                 'LEN' => '1161',
                 'METHOD' => 'HMMPfam',
                 'ACC' => 'PF04182',
                 'ID' => 'B-block_TFIIIC',
                 'START' => '8',
                 'END' => '411',
                 'SCORE' => '0',
                 'T' => 'T',
                 'DATE' => '14-Sep-2007',
                 'IPR' => 'IPR007309',
                 'IPRDESC' => 'B-block binding subunit of TFIIIC'
               }, 'Tie::Handle::CSV::Hash' );

One word of warning for files, like those from InterProScan, where fields can contain quotes:

YPR190C	9F739FA74F4A18BA	655	superfamily	SSF46785	"Winged helix" DNA-binding domain	53	163	9.3e-07	T	25-Sep-2007	

Parsing is handled by Text::CSV_XS, which has quite strict rules to handle quotes. If in doubt or your code is throwing errors, you can always remove them.

4 thoughts on “Tie::Handle::CSV

  1. I have a perl template which has timestamp and almost EXACTLY what you have written in the beginning of the paragraph. Bad programmer!

  2. i am not sure if a problem as simple as this really justifies the pains of dealing with external modules. I usually find it far simpler to write perl than to read it (even with my own scripts, not to mention other people’s code), and understanding what exactly a foreign modules does, where the pitfalls are, etc, is often much more tedious than writing a few lines of code. And, your script will run on any computer and won’t require other users to install a bunch of modules first. If you are like me (which I hope you are not) you are also happy that by writing the code yourself you can get away without this ugly object-oriented syntax.
    Of course, when it comes to more complex tasks, matters are different.

Comments are closed.