Genome annotation: who’s responsible?

There’s now an enormous number of genome projects – the Genomes Online Database lists 2 175 as of today, of which 429 are complete and published. Yet 10 years after the first completed genome, there are still no standards for storing and annotating genome data. What seems to have happened is that the major sequencing centres and databases have created their own pipelines for genome annotation. These centres are well-funded, possess large-scale computational infrastructure and are known and trusted by the community. Perhaps because of the sheer volume of data and the inability of small institutes to process it, we have come to rely on the output of large centres and assume that by and large, their data are “correct”. I thought I’d highlight a couple of examples which illustrate the danger of these assumptions.

No 23S in the Sargasso Sea?
Ribosomes, the cellular structures that translate mRNA to protein, are large complexes containing ribosomal RNA (rRNA) and proteins. Prokaryotic ribosomes contain 3 major rRNAs named 23S, 16S and 5S in order of size. In many but not all cases, the genes for rRNA occur in an operon, also in the order 23S-16S-5S. The sequence of rRNA genes tends to be very conserved, so any changes in the sequence are a good measure of divergence between species. For this reason, 16S rRNA sequence is widely used in phylogenetic studies of prokaryotes and is the basis for classifying Archaea in a domain separate to Bacteria.
You can also use 23S sequences for phylogeny, which is what a colleague of mine wanted to do. He was interested in a dataset of environmental sequences sampled from the Sargasso sea and was somewhat perplexed to find that there appeared to be no 23S rRNA sequences.
Off we go then, to explore GenBank. Knowing that the sequencing was done by the Institute for Biological Energy Alternatives, who have named their sequences “IBEA”, we do a simple search of the nucleotide database using “IBEA sargasso”. Up come 811 372 hits, all labelled as “IBEA_CTG_XXXXXXX”.


So far, so good. Refining our search to “IBEA sargasso 16S” gets us to 463 sequences. Let’s select one and take a closer look at the GenBank record – specifically the FEATURES section:

FEATURES             Location/Qualifiers
     source          1..912
                     /organism="environmental sequence"
                     /mol_type="genomic DNA"
                     /isolation_source="Sargasso Sea"
     misc_RNA        432..912
                     /product="16S ribosomal RNA"

Alrighty. This tells us that extracting 16S sequences from the dataset shouldn’t be too difficult – we can use something like Bioperl’s Bio::SeqIO module and look for product lines that contain “16S”. Strangely though, there are no “23S” lines. We’d expect to see them in this record because there are 431 bases upstream of the 16S gene, which is where 23S rRNA is often found. What’s going on?
The obvious next step is to take a “generic” 23S rRNA gene and BLAST it against the Sargasso sequences. Any old 23S sequence will do, since they’re quite conserved – we may as well use E. coli. Here’s a shortcut through NCBI.
From the main index page, choose Genomic Biology, then Genome, then Bacteria->Chromosome. Scroll down to Escherichia coli K12 and choose the link named NC_000913. This takes you to the organism genome project page. You’ll note that there are 157 structural RNAs and that the figure 157 is a link to them.
Click there and scroll down to the first 23S ribosomal RNA. The orange diamond to the right is a link to the sequence in fasta format. Copy that, head on back to the NCBI start page and navigate from BLAST to blastn (nucleotide v. nucleotide). Paste your 23S rRNA fasta sequence in the query box, choose “env_nt” as your database and hit the BLAST button.
We wait patiently for a while, then click “Format”. A new window opens and shows our BLAST hits. It’s obvious straight away that the Sargasso IBEA data do contain 23S sequences.


Scrolling down to the hits, we see that the first IBEA contig is IBEA_CTG_2158526 and the match spans positions 1763 – 2907 (just a small portion of this is shown below):


We’re almost at the end of the mystery missing 23S genes. Follow the GI link from that BLAST hit to the GenBank page for the contig. There in the FEATURES section we see a 16S rRNA gene:

     misc_RNA        5162..6316
                     /product="16S ribosomal RNA"

Yet upstream…no 23S. What do we see in the sequence region 1763 – 2907?

     misc_feature    1786..2013
                     hypothetical protein"
     misc_feature    1911..2345
                     hypothetical protein"
     CDS             complement(2329..2457)
                     /note="similar to gi|27363941"
     misc_feature    complement(2423..2545)
                     hypothetical protein"
     misc_feature    2592..2741
                     hypothetical protein"
     CDS             2656..2802
                     /note="similar to gi|25027444"
     misc_feature    2816..3274
                     hypothetical protein"

This is bad news. The region is annotated with several hypothetical proteins. Some have been translated to amino acid sequence and all are very short. Yet from our BLAST, we’re sure that this is the 23S rRNA gene. What’s happened here is that the annotators have translated the 23S gene in 6 frames during their search for protein sequences. As the 23S gene is quite long, by chance it does contain several in-frame translated regions that could be mistaken for proteins. But they aren’t. They’re junk.

Whose to blame here? In this case, I suggest the IBEA are at fault. These data are linked to a Science paper, the emphasis of which is on diversity (hence the focus on 16S) and novel genes/proteins. It seems that through this bias, they’ve (1) inadvertently “forgotten” about 23S and (2) used ORF finding methods based on BLAST which are of very questionable accuracy. They’re a large, well-funded outfit with plenty of computer power and could have done a lot better.

The only discussion of data quality for the Sargasso dataset that I’m aware of is in a BMC Bioinformatics article entitled An analysis of the Sargasso Sea resource and the consequences for database composition. I commented on this at the time but as is the nature of the comments system at the open-access journals, it hardly provoked a raging debate.

Too many pseudogenes
Onto our second example. I was involved with the genome sequencing of Methanococcoides burtonii, an archaeon isolated from an Antarctic lake. The genome consists of one circular chromosome, 2 575 032 bp in length. Sequence is now publicly available and annotated at the NCBI.
A few months ago, a collaborator contacted us with a problem. He had downloaded a fasta file of protein-coding genes for this organism and was unable to find a particular protein, even though it had been identified in earlier draft genome sequence and is known to be expressed from proteomics experiments. What was going on?
Once again, we turn to BLAST as a starting point. From the NCBI genome page, we can navigate to a specific BLAST page for this genome. The protein that we’re interested in is named monomethylamine methyltransferase. We can grab an example sequence in fasta format and run blastp versus M. burtonii proteins. We get a weak hit (E = 0.045) to trimethylamine methytransferase, but nothing else significant.

Well that’s odd, because we know that this protein exists. The next question – does the genome contain the DNA sequence that encodes the protein. Let’s BLAST again but this time we do protein query versus DNA – tblastn.


We get two good hits with of 82% and 81% identity. Looks like we do have genes for our protein. Why isn’t the protein in the protein list?
Our BLAST report gives us chromosomal coordinates – for the first hit, they’re from 883697 – 885070 in frame -3. We can load up the file NC_007955.gbk (large file – save it to disk) and take a look at that region. Here’s what we see:

     gene            complement(883694..885070)

Aha. NCBI have annotated our gene as a pseudogene. However, we know that it’s expressed – so why have they done this? The answer is that it’s a rather unusual gene containing an in-frame stop codon (UAG), which encodes a special amino acid named pyrrolysine. This becomes apparent if we view the genome using a tool such as Artemis or the Generic Genome Browser. Note the red stop codon marker in gene Mbur_0839, frame -3:


So why is the gene annotated incorrectly as a pseudogene? Partly because the NCBI don’t have access to experimental evidence, partly because their annotation pipeline is rather aggressive when it comes to in-frame stop codons – which are normally good indicators for a pseudogene. It’s worth noting that the Integrated Microbial Genomes used by Joint Genome Institute also annotates monomethylamine methyltransferase as a pseudogene. However, their web interface makes it somewhat easier to get to the gene of interest and see it in context.

So who’s at fault here? I would argue noone. You might suggest that computational pipelines for gene annotation could be improved but these are high-throughput methods suited to hundreds of genomes – we can’t expect them to account for every variation in every genome. The misannotation could be avoided if the people with the experimental evidence supplied it to the annotators – but there are few mechanisms which allow for that. The experimentalists and sequencers could work on a more highly-curated annotation themselves before submission – but that’s very time-consuming for them and they lack the expertise and hardware.

In summary

  • Genome annotation by individual groups is on the fade. We tend to rely on large sequencing centres and institutes because they have the expertise and the hardware.
  • We also tend to assume that the data generated are “correct”. This is often an incorrect assumption.
  • Once again, the underlying problem here is inadequate data standards. Life would be much easier if (1) a single standard genome annotation pipeline was adopted, (2) standard formats for data exchange were used and (3) mechanisms existed for researchers to incorporate curated data (such as experimental evidence) into pipelines. 10 years into genome sequencing, these things still do not exist.

3 thoughts on “Genome annotation: who’s responsible?

  1. Hey Neil, Your examples are very good illustrations of annotation errors… Do you come accross these often?
    Biologists tend to use annotations as gold standards to guide their research… these major annotation groups need to pay closer attention to their annotation. People say one genome group hired undergraduates for a summer project and used their Blast skills to annotate genomes and now we are propagating that annotation using yet more Blast…
    any insights on the percentage use of unusual aminoacids that could account for frame-call errors?

  2. Do you come accross these often?
    Oh yes. On the whole though, I’d say large centres with annotation pipelines (NCBI, JGI) do a pretty good job. And really they have to, because there’s too much data for smaller groups to handle. That’s why we’ve come to rely on the large data centres. There are 2 main points. First, always be critical of third-party data. Second, there’d be less of a problem if there were agreed standards for annotation pipelines. I could easily write my own annotation pipeline, but I don’t have the hardware required to deal with multiple genomes, plus what would be the point of yet another non-standard pipeline to add to the mix? So we rely on what’s out there.

    Couldn’t give you an exact figure for frame-call problems. Pyrrolysine is quite rare, some genomes in RefSeq are annotated correctly for selenocysteine. GenBank has adopted the characters “O” and “U” for these amino acids. The problem is that a lot of commonly-used software won’t yet recognise these non-standard characters and most ORF-calling software doesn’t account for coding stop codons. Again, there’s no reason why coding stop codons couldn’t be handled by an annotation pipeline if someone put their mind to it and a standard was adopted.

  3. I had a similar situation with a Mycobacterium tuberculosis gene (Rv1379) when extracting intergenic regions. It is close to a gene (Rv1378c) on the opposite strand; both genes are divergently expressed.

    This time, the other gene was misannotated to incorporate the largest ORF, which happened to overlap with my gene (and hence gobbled up all the intergenic region). If they had compared the predicted ORF to its closest homologues in other bacterial chromosomes they would have found the (likely) correct start codon. The error was reported but hasn’t been corrected yet.

Comments are closed.