Bioperl, wordcount and dimers

In the dying days of my current job, I actually got to write some Perl and do a bit of analysis!
The problem: some colleagues have noticed a high frequency of a certain dinucleotide in a certain group of organisms. Can we provide the frequency of dinucleotides in a couple of our draft genomes?
We sure can, with a few considerations. I used Bio::SeqIO to read in a file of contig sequences and Bio::Factory::EMBOSS to submit each one to the EMBOSS program ‘wordcount’, set to look for wordsize = 2. One problem is that if the count is zero, wordcount outputs nothing for that dinucleotide. So you have to set up a hash where the keys are all possible dinucleotides with values = 0, then increment the count as you go. Second problem – this is draft sequence and can contain ‘N’ characters. So you have to allow for the possibility of AN, CN, GN, NN, TN, NA, NC, NG and NT. Third problem – some of the fasta headers contain a “|” symbol in the ID. The bash shell gets upset if you try to write to an output filename with that symbol, so a nasty s/// hack replaces them with underscores.
We loop through each sequence, run wordcount, parse the output, add up the counts for each sequence and output a nice plain text CSV file containing sequence ID, sequence length and count for the 25 possible dinucleotides. Zip it up, send it off and hope they can import it to their spreadsheet at the other end.
All in a days work. Note to self: clean up the script and get it into CVS.