“Take a look at the TP53 mutation database“, my colleague suggested. “OK then, I will”, I replied.
I present what follows as “a typical day in the life of a bioinformatician”.
I’ve had a half-formed, but not very interesting blog post in my head for some months now. It’s about a conversation I had with a PhD student, around 10 years ago, after she went to a bioinformatics talk titled “Excel is not a database” and how she laughed as I’d been telling her that “for years already”. That’s basically the post so as I say, not that interesting, except as an illustration that we’ve been talking about this stuff for a long time (and little has changed).Anyway, we have something better. I was exploring PubMed Commons, which is becoming a very good resource. The top-featured comment looks very interesting (see image, right).
Intrigued, I went to investigate the Database of Cross-contaminated or Misidentified Cell Lines, hovered over the download link and saw that it’s – wait for it – a PDF. I’ll say that again. The “database” is a PDF.
The sad thing is that this looks like very useful, interesting information which I’m sure would be used widely if presented in an appropriate (open) format and better-publicised. Please, biological science, stop embarrassing yourself. If you don’t know how to do data properly, talk to someone who does.
This post is an apology and an attempt to make amends for contributing to the decay of online bioinformatics resources. It’s also, I think, a nice example of why reproducible research can be difficult.
Come back in time with me 10 years, to 2004.
While we’re on the topic of mistaking Archaea for Bacteria, here’s an issue with the NCBI FTP site that has long annoyed me and one workaround. Warning: I threw this together minutes ago and it’s not fully tested.
Update July 7 2014: NCBI have changed things so code in this post no longer works
File under: simple, but a useful reminder
UCSC Genome Bioinformatics is one of the go-to locations for genomic data. They are also kind enough to provide access to their MySQL database server:
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A
However, users are given fair warning to “avoid excessive or heavy queries that may impact the server performance.” It’s not clear what constitutes excessive or heavy but if you’re in any doubt, it’s easy to create your own databases locally. It’s also easy to create only the tables that you require, as and when you need them.
As an example, here’s how you could create only the ensGene table for the latest hg19 database. Here, USER and PASSWD represent a local MySQL user and password with full privileges:
# create database mysql -u USER -pPASSWD -e 'create database hg19' # obtain table schema wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ensGene.sql # create table mysql -u USER -pPASSWD hg19 < ensGene.sql # obtain and import table data wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ensGene.txt.gz gunzip ensGene.txt.gz mysqlimport -u USER -pPASS --local hg19 ensGene.txt
It’s very easy to automate this kind of process using shell scripts. All you need to know is the base URL for the data, http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ and that there are two files with the same prefix per table: one for the schema (*.sql) and one with the data (*.txt.gz).
The Nature stable of journals. A byword for quality, integrity, impact. Witness this recent offering from Nature Biotechnology:
Bale, S. et al. (2011)
MutaDATABASE: a centralized and standardized DNA variation database.
Nature Biotechnology 29, 117–118
Unfortunately, although it describes an open, public database, the article itself costs $32 to read without subscription (update: it’s freely available as of one day after this post). Not to be deterred, I went to investigate MutaDATABASE itself.
The alarm bells began to ring right there on the index page (see screenshot, right).
Could that be right? I tried several browsers, in case of a rendering problem. Same result – no contents.
Clicking on some of the links in the sidebar, I became more concerned. Here’s an example URL:
I recognise that form of URL – it comes from Joomla, a content management system. I’ve had servers compromised only twice in my career – both times, due to Joomla-based websites. Their security may have improved since, I guess – but this smacks of people looking to build a website quickly without investigating the alternatives.
Then, there are the spelling/grammatical errors, the “coming soons”, the “under constructions”, the news page not updated in almost 5 months. And as Tim Yates pointed out to me:
Who knows, MutaDatabase may turn out to be terrific. Right now though, it’s rather hard to tell. The database and web server issues of Nucleic Acids Research require that the tools described be functional for review and publication. Apparently, Nature Biotechnology does not.
I was reading an old post that describes GEOmetadb, a downloadable database containing metadata from the GEO database. We had a brief discussion in the comments about the growth in GSE records (user-submitted) versus GDS records (curated datasets) over time. Below, some quick and dirty R code to examine the issue, using the Bioconductor GEOmetadb package and ggplot2. Left, the resulting image – click for larger version.
Is the curation effort keeping up with user submissions? A little difficult to say, since GEOmetadb curation seems to have its own issues: (1) why do GDS records stop in 2008? (2) why do GDS (curated) records begin earlier than GSE (submitted) records?
library(GEOmetadb) library(ggplot2) # update database if required using getSQLiteFile() # connect to database; assumed to be in user $HOME con <- dbConnect(SQLite(), "~/GEOmetadb.sqlite") # fetch "last updated" dates for GDS and GSE gds <- dbGetQuery(con, "select update_date from gds") gse <- dbGetQuery(con, "select last_update_date from gse") # cumulative sums by date; no factor variables gds.count <- as.data.frame(cumsum(table(gds)), stringsAsFactors = F) gse.count <- as.data.frame(cumsum(table(gse)), stringsAsFactors = F) # make GDS and GSE data frames comparable colnames(gds.count) <- "count" colnames(gse.count) <- "count" # row names (dates) to real dates gds.count$date <- as.POSIXct(rownames(gds.count)) gse.count$date <- as.POSIXct(rownames(gse.count)) # add type for plotting gds.count$type <- "gds" gse.count$type <- "gse" # combine GDS and GSE data frames gds.gse <- rbind(gds.count, gse.count) # and plot records over time by type png(filename = "geometadb.png", width = 800, height = 600) print(ggplot(gds.gse, aes(date,count)) + geom_line(aes(color = type))) dev.off()
I’ve been experimenting with MongoDB’s map-reduce, called from Ruby, as a replacement for Ruby’s Enumerable methods (map/collect, inject). It’s faster. Much faster.
Next, the details but first – the disclaimers:
In short nothing is optimised, my code is probably awful and I’m making it up as I go along. Here we go then!
Read the rest…
During all the recent discussion around Neandertals and modern humans, it’s often pointed out that Homo sapiens is the sole extant representative of the genus Homo. I began to wonder “how unusual is this?” in a FriendFeed comment thread. What resources exist that could help us to answer this question?
Genera that contain only one species are termed monotypic. Wikipedia even has a category page for this topic but their lists are limited, since Wikipedia is not a comprehensive taxonomy resource.
Taxonomy is not my specialty but once in a while, I enjoy challenging myself with unfamiliar resources and data types. I figured initially that we could get some way towards an answer using BioSQL and the NCBI taxonomy database. As it turned out I was completely wrong, but it was an interesting educational exercise. I turned instead to a “real” taxonomy resource, the Integrated Taxonomic Information System, or ITIS.
Read the rest…
I no longer work on protein kinases but when I did, PhosphoGRID is the kind of database that I would have wanted to see. It features:
All it lacks is a RESTful API, but nothing is perfect :-)
Published in the little-known but often-useful journal Database:
PhosphoGRID: a database of experimentally verified in vivo protein phosphorylation sites from the budding yeast Saccharomyces cerevisiae.