#arseniclife: the genome

It’s about one year since the science story dubbed #arseniclife hit the headlines. November 30th saw the release of a draft genome sequence for Halomonas sp. GFAJ-1, the bacterium behind the furore.

As Iddo pointed out on Twitter, sequencing the DNA from GFAJ-1 is itself strong evidence against arsenate in the DNA backbone, since the sequencing chemistry would be highly unlikely to work in that case. However, if like me you think that a new microbial genome provides the most fun to be had in bioinformatics [*], you’ll be excited by the availability of the data.

In this post then: where to get it, some very preliminary analysis and some things that you might like to to with it. Projects for your students, perhaps.

[*] note to self: why, then, am I working on colorectal cancer?
Read the rest…

How to: create a partial UCSC genome MySQL database

File under: simple, but a useful reminder

UCSC Genome Bioinformatics is one of the go-to locations for genomic data. They are also kind enough to provide access to their MySQL database server:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A

However, users are given fair warning to “avoid excessive or heavy queries that may impact the server performance.” It’s not clear what constitutes excessive or heavy but if you’re in any doubt, it’s easy to create your own databases locally. It’s also easy to create only the tables that you require, as and when you need them.
As an example, here’s how you could create only the ensGene table for the latest hg19 database. Here, USER and PASSWD represent a local MySQL user and password with full privileges:

# create database
mysql -u USER -pPASSWD -e 'create database hg19'
# obtain table schema
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ensGene.sql
# create table
mysql -u USER -pPASSWD hg19 < ensGene.sql
# obtain and import table data
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ensGene.txt.gz
gunzip ensGene.txt.gz
mysqlimport -u USER -pPASS --local hg19 ensGene.txt

It’s very easy to automate this kind of process using shell scripts. All you need to know is the base URL for the data, http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ and that there are two files with the same prefix per table: one for the schema (*.sql) and one with the data (*.txt.gz).

Conservative (with a small “c”) research

This is really interesting. I’m reading it at work so I can’t tell you if it’s behind the paywall, but I sincerely hope not; it deserves to be read widely:

Edwards, A.M. et al. (2011)
Too many roads not taken.
Nature 470: 163–165

Most protein research focuses on those known before the human genome was mapped. Work on the slew discovered since, urge Aled M. Edwards and his colleagues.

The article includes some nicely-done bibliometric analysis. I’ve lifted a few quotes that illustrate some of the key points.

  • More than 75% of protein research still focuses on the 10% of proteins that were known before the genome was mapped
  • Around 65% of the 20,000 kinase papers published in 2009 focused on the 50 proteins that were the ‘hottest’ in the early 1990s
  • Similarly, 75% of the research activity on nuclear hormone receptors in 2009 focused on the 6 (of 48) receptors that were most studied in the mid 1990s
  • A common assumption is that previous research efforts have preferentially identified the most important proteins – the evidence doesn’t support this
  • Why the reluctance to work on the unknown? […] scientists are wont to “fondle their problems”
  • Funding and peer-review systems are risk-averse
  • The availability of chemical probes for a given receptor dictates the level of research interest in it; the development of these tools is not driven by the importance of the protein

I love the phrase “fondle their problems.”

I’ve long felt that academic research has increasingly little to do with “advancing knowledge” and is more concerned with churning out “more of the same” to consolidate individual careers. However, that’s just me being opinionated and anecdotal. What do you think?

23 and me – yes, me – part 2

Sample journey and arrival


Spitting across the Pacific

My tube of spit arrived at the lab on May 19. Six days door-to-door via Guangzhou, Anchorage and Memphis to LA.


23andMe raw data menu

On arrival, a confirmatory email: “The spit sample you recently submitted to 23andMe for the person listed above has been received by the laboratory and is now pending analysis; the process usually takes 6-8 weeks. You will receive another email notification from us as soon as the data for this sample are ready to be accessed through your 23andMe account.”

In the meantime, there’s plenty to explore at the 23andMe website. Anyone can create a demo account, which allows you to explore anonymous sample data to get a feel for what you’ll see when your own sample is processed. Naturally, I’m most excited by the options to browse and download raw data. You can also participate in around 20 health and genetics surveys which are a good way to kill time, although not many of them provide instant personal gratification.

Next update – some time in July.

23 and me – yes, me – part 1

Until recently, I was not even aware that there is a DNA day. Nor can I tell you exactly when and where I noticed that 23andMe, the personal genomics company, launched a sale to celebrate the day – I imagine it flashed by on Twitter or FriendFeed. I can tell you that like many others I decided that finally, I could justify the expense, signed up (with around 15 minutes to spare – thanks to the 17 hour Sydney/California time difference) and I’m now waiting for sample arrival and processing.

I thought it might be interesting to blog the experience and provided that I don’t discover anything disturbing, I’ll share some of my data here. Related posts will be tagged with “23andme” and here is part 1 which covers sign-up, delivery, sample collection and return.
Read the rest…

Did someone just admit that journal articles don’t communicate science effectively?

A brief article in the latest Journal of Proteome Research, entitled The Structural Genomics Consortium makes its presence known (ACS, subscription-only), begins with a summary of output from the SGC:

Researchers with the Structural Genomics Consortium (SGC) have been toiling away in their labs for ∼4 years now, solving and depositing hundreds of protein structures in public databases. To date, they have deposited 15% of the human protein structures solved so far and have published >100 papers on their findings. Yet, many scientists still don’t know what SGC does.

The next paragraph made me sit upright (my italics):

We haven’t spent a lot of time on communication because we wanted to spend the time on science and scientific publications, but we appreciate that if we want our scientific output to be used to its maximum, we need to let more people know what we’ve been up to.

I may be reading too much into that sentence – perhaps they define communication as outreach via media to a wider community, as opposed to publications which are aimed at a specialist audience. However, I’m tempted to see it as a subconscious confession that the traditional journal article is increasingly ineffective as a communication tool in our science big, science connected world.

Giant panda genome: mapped or sequenced?

I’m with Ogden Nash who said:

I love the baby giant panda,
I’d welcome one to my veranda

This week, I learned via Keith that Chinese scientists announced the completion of the giant panda genome. An impressive achievement, given that the project was announced in March this year, but what exactly has been completed? Has the genome been sequenced – that is, there are strings of A, C, G and T covering most chromosomes, or mapped – that is, the approximate chromosomal location of most genes determined? The media seem unsure.

And so on. Here’s a Google News search with more hits.

So what has been achieved – sequencing or mapping? If the former, is it really complete (I doubt this) or draft – and if draft, what kind of quality? And where are the data? Nothing in the genome project section of NCBI as yet.

Genomic analysis of Pseudoalteromonas tunicata

Some years ago, I provided advice and a little analysis for a group at UNSW studying marine bacteria. It’s nice to see that they remembered me:

Thomas, T., Evans, F.F., Schleheck, D., Mai-Prochnow, A., Burke, C., Penesyan, A., Dalisay, D.S., Stelzer-Braid, S., Saunders, N., Johnson, J., Ferriera, S., Kjelleberg, S. and Egan, S. (2008).
Analysis of the Pseudoalteromonas tunicata Genome Reveals Properties of a Surface-Associated Life Style in the Marine Environment.
PLoS ONE 3:e3252.

If correlating genomic features with microbial physiology is your thing, go and check it out. The article is open access, for your pleasure – as are five of my last six efforts, I just noticed.