I enjoyed this story from the OpenHelix blog today, describing a Microsoft Research project to mine DNA sequences from web pages and map them to UCSC genome builds.
Laura DeMare asks: what was the most-hit gene?
File under: simple, but a useful reminder
UCSC Genome Bioinformatics is one of the go-to locations for genomic data. They are also kind enough to provide access to their MySQL database server:
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A
However, users are given fair warning to “avoid excessive or heavy queries that may impact the server performance.” It’s not clear what constitutes excessive or heavy but if you’re in any doubt, it’s easy to create your own databases locally. It’s also easy to create only the tables that you require, as and when you need them.
As an example, here’s how you could create only the ensGene table for the latest hg19 database. Here, USER and PASSWD represent a local MySQL user and password with full privileges:
# create database mysql -u USER -pPASSWD -e 'create database hg19' # obtain table schema wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ensGene.sql # create table mysql -u USER -pPASSWD hg19 < ensGene.sql # obtain and import table data wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ensGene.txt.gz gunzip ensGene.txt.gz mysqlimport -u USER -pPASS --local hg19 ensGene.txt
It’s very easy to automate this kind of process using shell scripts. All you need to know is the base URL for the data, http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ and that there are two files with the same prefix per table: one for the schema (*.sql) and one with the data (*.txt.gz).