RefSeq vs. GenBank

I’ve noticed occasional posts on the Bioperl lists regarding RefSeq, but have never paid much attention.  After all, RefSeq genbank (the file) is the same as GenBank genbank (the file), right?

As ever, the answer is “sort of but not quite”.  Here’s how a standard GenBank entry such as AY627381 might list the genes in an rRNA operon:

source
rRNA
misc_RNA
tRNA
rRNA

In this case, we’re looking at primary tags for the whole sequence, 16S, 16S-23S ITS, a tRNA and the 23S.

Now, here’s an example of a similar region from a ‘newer style’ RefSeq record in genbank format, typically a genome record such as NC_000909:

gene
rRNA
gene
tRNA
gene
tRNA
gene
rRNA

Here we have 23S, 2 tRNA genes and a 16S.  Spot the difference?  All genes now have 2 primary tags – gene/tRNA or gene/rRNA (and also gene/CDS for proteins).

What this means is – if I am reading a file and looking for features based on primary tag and separated by N genes, I need to use N genes for old-style genbank and N*2 genes for new-style RefSeq.  If I don’t know in advance what type I’m dealing with (because I just grabbed a load of records) or even how consistent this is within a type, I have problems.

Just another example of the daily problems caused by inconsistent primary data formats.