#arseniclife: the genome

It’s about one year since the science story dubbed #arseniclife hit the headlines. November 30th saw the release of a draft genome sequence for Halomonas sp. GFAJ-1, the bacterium behind the furore.

As Iddo pointed out on Twitter, sequencing the DNA from GFAJ-1 is itself strong evidence against arsenate in the DNA backbone, since the sequencing chemistry would be highly unlikely to work in that case. However, if like me you think that a new microbial genome provides the most fun to be had in bioinformatics [*], you’ll be excited by the availability of the data.

In this post then: where to get it, some very preliminary analysis and some things that you might like to to with it. Projects for your students, perhaps.

[*] note to self: why, then, am I working on colorectal cancer?

1. The data
Here’s your starting point: the NCBI Whole Genome Shotgun Sequencing Project page for GFAJ-1. It provides a nice summary of the data including: the sequencing technology used (Illumina GAIIx), coverage (170x), assembly software (Mira) and number of contigs (103).

Files for download: annotated contigs in Genbank format (wgs.AHBC.1.gbff.gz), the WGS “master” Genbank file (wgs.AHBC.mstr.gbff.gz – not really necessary) and contig sequences in fasta format (wgs.AHBC.1.fsa_nt.gz).

You can also fetch raw sequencing data from the Short Read Archive, but more about that a little later.

Be aware that this is a draft genome, annotated using an automated computational pipeline. This means that there will be errors in the sequence, the genes and the annotation. It also means plenty of potential improvements can be made by smart, enthusiastic bioinformaticians.

2. Arsenic metabolism
Of course your first instinct is to see whether any of the putative protein products have been annotated as having an “arsenic-related” function. Since this might involve arsenic, arsenate or arsenite, the simplest approach is:

grep arsen wgs.AHBC.1.gbff

And the result:

                     /product="arsenite-activated ATPase ArsA"
                     /note="COG1055 Na+/H+ antiporter NhaD and related arsenite
                     /product="arsenical-resistance protein"

Digging around in the file a little more reveals more information about these two putative products:

CDS             90659..91666
                     /note="COG0003 Oxyanion-translocating ATPase"
                     /product="arsenite-activated ATPase ArsA"
CDS             8835..9863
                     /note="COG0798 Arsenite efflux pump ACR3 and related
                     /product="arsenical-resistance protein"

So it seems we may have a system for pumping arsenite out of the cell. Not unexpected, for an organism known to have high arsenic tolerance. Presumably, if the form of arsenic in the environment is arsenate, there’s a mechanism for reducing it to arsenite (if these annotations are correct). Some interesting bioenergetic possibilities there.

By the way, these findings seem to be at odds with the statement in this article from USA Today that “none of them [the genes] look like identified arsenic resistance genes found in other bugs.”

If we were to postulate that this organism incorporated arsenate into the DNA backbone (I do not believe it does), how might we use the genome sequence data? I guess there are two main possibilities:

  1. The genes for nucleotide metabolism encode modified proteins which can catalyse reactions using arsenate-containing compounds instead of, or as well as, phosphates
  2. The genome contains novel genes for nucleotide metabolism that we might not even recognise as such

Exploring idea (1) a little further – nucleotide biosynthesis is a complicated business. See, for example, this KEGG pathway map for purine metabolism. I suppose in principle, it’s possible to retrieve the sequences for some of the key enzymes, search for GFAJ-1 orthologs using BLAST, then further investigate the GFAJ-1 protein sequences (domains? homology modelling?) to see if any of them are “unusual”. However, this entails all manner of unfounded assumptions about a hypothetical arsenic-based biochemistry: for example, the existence of analogous mono-, di- and tri-arsenate compounds. I wouldn’t go there myself.

3. Phylogeny and genome alignment
How “Halomonas-like” is Halomonas GFAJ-1 anyway? Let’s start by looking for 16S rRNA genes:

grep -P "^LOCUS|ribosomal RNA" wgs.AHBC.1.gbff

Like many bacteria, GFAJ-1 appears to have multiple rRNA operons (unless the assembly has failed to merge them). One of these is on contig AHBC01000086:

LOCUS       AHBC01000086            5885 bp    DNA     linear   BCT 02-DEC-2011                     /product="16S ribosomal RNA"                     
 /product="23S ribosomal RNA"                     
 /product="5S ribosomal RNA"

It’s easy enough to copy/paste the 16S rDNA sequence from the file, or you can use the handy bp_extract_feature_seq script which comes with Bioperl to grab all rDNA sequences. The 16S sequence can be identified by its length (not the shortest which is 5S, or the longest which is 23S):

bp_extract_feature_seq -i wgs.AHBC.1.gbff --format genbank --feature=rRNA -o rrna.fa


NCBI BLAST GFAJ-1 vs. Bacteria

What follows is emphatically not the correct approach to phylogenetic analysis but let’s run with it anyway: a quick BLAST versus bacteria at NCBI and the results as a tree view. Click image at right for larger version.

Halomonas appears to be a rather diverse genus but indeed, GFAJ-1 seems most similar to other Halomonas. Is there genomic information for any of the related organisms? A quick search suggests not; 7 organisms, only 2 of which have available genome sequence, neither of which feature in the top 100 BLAST hits for GFAJ-1 16S rDNA.

Regardless, it might be fun to try aligning the GFAJ-1 contigs with a reference genome, in this case from Halomonas elongata. We can save the chromosome sequence in Fasta format, install MUMmer and run:

nucmer -maxmatch -p nucmer helongata.fa wgs.AHBC.1.fsa_nt
show-coords -r -c -l nucmer.delta > nucmer.coords
mapview -n 1 -p mapview nucmer.coords
xfig mapview_0.fig
And sure, there are a few nicely-aligned segments (click image, right), but nothing spectacular in terms of a good reference genome which we could use to order large numbers of contigs.

Part of MUMmer alignment: GFAJ-1 contigs to H. elongata chromosome

4. Assembly quality and raw data
First, a quick look at the contig statistics, using grep and a little R:

grep ^LOCUS wgs.AHBC.1.gbff > locus.txt
loc <- read.table("locus.txt", header = F, stringsAsFactors = F)
# [1]    545   1385   9163  48993 233806
ggplot(loc) + geom_density(aes(V3), fill = "cornsilk") + theme_bw() + 
opts(title = "Density plot GFAJ-1 contig lengths")

Density plot of GFAJ-1 contig length

This tells us that many of the contigs are rather short: 50% are 9 163 bp or less (the median) and 12 contigs are 100 000 bp or more in length. The long contigs add up to around 1/3 of the total genome size, which seems like a good result.

Good, but might other assembly software perform better? Happily, you’re in a position to find out because the raw sequencing data can be found in the Short Read Archive (SRA). You can grab them in SRA format (danger! 7.5 GB!) like so:

wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR385/SRR385952/SRR385952.sra

You’ll then need to grab the SRA toolkit and dump the SRA to FASTQ format. Things become complex here because the toolkit documentation is not great and we need to figure out what (a) kind of reads we have and (b) how to restore them from the SRA file.

To cut a long story short, this line tells us that we have a paired-end library:

Layout: PAIRED, Orientation: , Nominal length: 150, Nominal Std Dev: 30

and so we need to dump that into 2 fastq files:

fastq-dump --split-files SRR385952.sra

but this has the unfortunate effect of generating identical fastq headers in both files:

head -4 SRR385952_*.fastq
==> SRR385952_1.fastq <==
@SRR385952.1 HWI-ST484_0123:8:1101:1154:2066 length=101
+SRR385952.1 HWI-ST484_0123:8:1101:1154:2066 length=101

==> SRR385952_2.fastq <==
@SRR385952.1 HWI-ST484_0123:8:1101:1154:2066 length=101
+SRR385952.1 HWI-ST484_0123:8:1101:1154:2066 length=101

and so, some sed is required to generate the more standard “/1 or /2” fastq header suffix:

sed -i -e 's/^@SRR385952\.[0-9]*/&\/1/' SRR385952_1.fastq
sed -i -e 's/^@SRR385952\.[0-9]*/&\/2/' SRR385952_2.fastq

After which, you can experiment with the de novo assembler of your choice. I’ll leave the appalling quality and documentation of NGS software in general for another post.

Enjoy – but don’t spend too much time on it. You won’t find the key to “arsenic life” in this genome.