BLATting the internet: the most frequent gene?

I enjoyed this story from the OpenHelix blog today, describing a Microsoft Research project to mine DNA sequences from web pages and map them to UCSC genome builds.

Laura DeMare asks: what was the most-hit gene?

Continue reading

How to: create a partial UCSC genome MySQL database

File under: simple, but a useful reminder

UCSC Genome Bioinformatics is one of the go-to locations for genomic data. They are also kind enough to provide access to their MySQL database server:

mysql --user=genome -A

However, users are given fair warning to “avoid excessive or heavy queries that may impact the server performance.” It’s not clear what constitutes excessive or heavy but if you’re in any doubt, it’s easy to create your own databases locally. It’s also easy to create only the tables that you require, as and when you need them.
As an example, here’s how you could create only the ensGene table for the latest hg19 database. Here, USER and PASSWD represent a local MySQL user and password with full privileges:

# create database
mysql -u USER -pPASSWD -e 'create database hg19'
# obtain table schema
# create table
mysql -u USER -pPASSWD hg19 < ensGene.sql
# obtain and import table data
gunzip ensGene.txt.gz
mysqlimport -u USER -pPASS --local hg19 ensGene.txt

It’s very easy to automate this kind of process using shell scripts. All you need to know is the base URL for the data, and that there are two files with the same prefix per table: one for the schema (*.sql) and one with the data (*.txt.gz).