Category Archives: genomics

BLATting the internet: the most frequent gene?

I enjoyed this story from the OpenHelix blog today, describing a Microsoft Research project to mine DNA sequences from web pages and map them to UCSC genome builds.

Laura DeMare asks: what was the most-hit gene?

Continue reading

Using the Ensembl Variant Effect Predictor with your 23andme data

I subscribe to the Ensembl blog and found, in my feed reader this morning, a post which linked to the Variant Effect Predictor (VEP). The original blog post, strangely, has disappeared.

Not to worry: so, the VEP takes genotyping data in one of several formats, compares it with the Ensembl variation + core databases and returns a summary of how the variants affect transcripts and regulatory regions. My first thought – can I apply this to my own 23andme data?

Read the rest…

How to: bulk retrieval of archaeal genome sequences from the NCBI FTP site

While we’re on the topic of mistaking Archaea for Bacteria, here’s an issue with the NCBI FTP site that has long annoyed me and one workaround. Warning: I threw this together minutes ago and it’s not fully tested.

Update July 7 2014: NCBI have changed things so code in this post no longer works

Read the rest…

#arseniclife: the genome

It’s about one year since the science story dubbed #arseniclife hit the headlines. November 30th saw the release of a draft genome sequence for Halomonas sp. GFAJ-1, the bacterium behind the furore.

As Iddo pointed out on Twitter, sequencing the DNA from GFAJ-1 is itself strong evidence against arsenate in the DNA backbone, since the sequencing chemistry would be highly unlikely to work in that case. However, if like me you think that a new microbial genome provides the most fun to be had in bioinformatics [*], you’ll be excited by the availability of the data.

In this post then: where to get it, some very preliminary analysis and some things that you might like to to with it. Projects for your students, perhaps.

[*] note to self: why, then, am I working on colorectal cancer?
Read the rest…

How to: create a partial UCSC genome MySQL database

File under: simple, but a useful reminder

UCSC Genome Bioinformatics is one of the go-to locations for genomic data. They are also kind enough to provide access to their MySQL database server:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A

However, users are given fair warning to “avoid excessive or heavy queries that may impact the server performance.” It’s not clear what constitutes excessive or heavy but if you’re in any doubt, it’s easy to create your own databases locally. It’s also easy to create only the tables that you require, as and when you need them.
As an example, here’s how you could create only the ensGene table for the latest hg19 database. Here, USER and PASSWD represent a local MySQL user and password with full privileges:

# create database
mysql -u USER -pPASSWD -e 'create database hg19'
# obtain table schema
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ensGene.sql
# create table
mysql -u USER -pPASSWD hg19 < ensGene.sql
# obtain and import table data
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ensGene.txt.gz
gunzip ensGene.txt.gz
mysqlimport -u USER -pPASS --local hg19 ensGene.txt

It’s very easy to automate this kind of process using shell scripts. All you need to know is the base URL for the data, http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ and that there are two files with the same prefix per table: one for the schema (*.sql) and one with the data (*.txt.gz).

Conservative (with a small “c”) research

This is really interesting. I’m reading it at work so I can’t tell you if it’s behind the paywall, but I sincerely hope not; it deserves to be read widely:

Edwards, A.M. et al. (2011)
Too many roads not taken.
Nature 470: 163–165
doi:10.1038/470163a

Most protein research focuses on those known before the human genome was mapped. Work on the slew discovered since, urge Aled M. Edwards and his colleagues.

The article includes some nicely-done bibliometric analysis. I’ve lifted a few quotes that illustrate some of the key points.

  • More than 75% of protein research still focuses on the 10% of proteins that were known before the genome was mapped
  • Around 65% of the 20,000 kinase papers published in 2009 focused on the 50 proteins that were the ‘hottest’ in the early 1990s
  • Similarly, 75% of the research activity on nuclear hormone receptors in 2009 focused on the 6 (of 48) receptors that were most studied in the mid 1990s
  • A common assumption is that previous research efforts have preferentially identified the most important proteins – the evidence doesn’t support this
  • Why the reluctance to work on the unknown? [...] scientists are wont to “fondle their problems”
  • Funding and peer-review systems are risk-averse
  • The availability of chemical probes for a given receptor dictates the level of research interest in it; the development of these tools is not driven by the importance of the protein

I love the phrase “fondle their problems.”

I’ve long felt that academic research has increasingly little to do with “advancing knowledge” and is more concerned with churning out “more of the same” to consolidate individual careers. However, that’s just me being opinionated and anecdotal. What do you think?

23 and me – yes, me – part 2

Sample journey and arrival

fedex-delivered

Spitting across the Pacific

My tube of spit arrived at the lab on May 19. Six days door-to-door via Guangzhou, Anchorage and Memphis to LA.

23andmeraw

23andMe raw data menu

On arrival, a confirmatory email: “The spit sample you recently submitted to 23andMe for the person listed above has been received by the laboratory and is now pending analysis; the process usually takes 6-8 weeks. You will receive another email notification from us as soon as the data for this sample are ready to be accessed through your 23andMe account.”

In the meantime, there’s plenty to explore at the 23andMe website. Anyone can create a demo account, which allows you to explore anonymous sample data to get a feel for what you’ll see when your own sample is processed. Naturally, I’m most excited by the options to browse and download raw data. You can also participate in around 20 health and genetics surveys which are a good way to kill time, although not many of them provide instant personal gratification.

Next update – some time in July.

23 and me – yes, me – part 1

Until recently, I was not even aware that there is a DNA day. Nor can I tell you exactly when and where I noticed that 23andMe, the personal genomics company, launched a sale to celebrate the day – I imagine it flashed by on Twitter or FriendFeed. I can tell you that like many others I decided that finally, I could justify the expense, signed up (with around 15 minutes to spare – thanks to the 17 hour Sydney/California time difference) and I’m now waiting for sample arrival and processing.

I thought it might be interesting to blog the experience and provided that I don’t discover anything disturbing (finding out that I’m actually a woman, for example), I’ll share some of my data here. Related posts will be tagged with “23andme” and here is part 1 which covers sign-up, delivery, sample collection and return.
Read the rest…