Back in July, I was complaining about the latest abuse of the word “database” by biologists: the “PDF as database.”
This led to some very productive discussion using PubMed Commons and I’m happy to report that misidentified and contaminated cell lines are now included in the NCBI BioSample database.
As the news release notes, rather alarmingly:
This problem is so common it is thought that thousands of misleading and potentially erroneous papers have been published using cell lines that are incorrectly identified
So it would be useful if there were a direct link between the BioSample record for a cell line and PubMed records in which it was used…
“Take a look at the TP53 mutation database“, my colleague suggested. “OK then, I will”, I replied.
I present what follows as “a typical day in the life of a bioinformatician”.
I’ve had a half-formed, but not very interesting blog post in my head for some months now. It’s about a conversation I had with a PhD student, around 10 years ago, after she went to a bioinformatics talk titled “Excel is not a database” and how she laughed as I’d been telling her that “for years already”. That’s basically the post so as I say, not that interesting, except as an illustration that we’ve been talking about this stuff for a long time (and little has changed).
HEp-2 or not HEp2?
Anyway, we have something better. I was exploring PubMed Commons
, which is becoming a very good resource. The top-featured comment
looks very interesting (see image, right).
Intrigued, I went to investigate the Database of Cross-contaminated or Misidentified Cell Lines, hovered over the download link and saw that it’s – wait for it – a PDF. I’ll say that again. The “database” is a PDF.
The sad thing is that this looks like very useful, interesting information which I’m sure would be used widely if presented in an appropriate (open) format and better-publicised. Please, biological science, stop embarrassing yourself. If you don’t know how to do data properly, talk to someone who does.
This post is an apology and an attempt to make amends for contributing to the decay of online bioinformatics resources. It’s also, I think, a nice example of why reproducible research can be difficult.
Come back in time with me 10 years, to 2004.
While we’re on the topic of mistaking Archaea for Bacteria, here’s an issue with the NCBI FTP site that has long annoyed me and one workaround. Warning: I threw this together minutes ago and it’s not fully tested.
Update July 7 2014: NCBI have changed things so code in this post no longer works
Read the rest…
File under: simple, but a useful reminder
UCSC Genome Bioinformatics is one of the go-to locations for genomic data. They are also kind enough to provide access to their MySQL database server:
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A
However, users are given fair warning to “avoid excessive or heavy queries that may impact the server performance.” It’s not clear what constitutes excessive or heavy but if you’re in any doubt, it’s easy to create your own databases locally. It’s also easy to create only the tables that you require, as and when you need them.
As an example, here’s how you could create only the ensGene table for the latest hg19 database. Here, USER and PASSWD represent a local MySQL user and password with full privileges:
# create database
mysql -u USER -pPASSWD -e 'create database hg19'
# obtain table schema
# create table
mysql -u USER -pPASSWD hg19 < ensGene.sql
# obtain and import table data
mysqlimport -u USER -pPASS --local hg19 ensGene.txt
It’s very easy to automate this kind of process using shell scripts. All you need to know is the base URL for the data, http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ and that there are two files with the same prefix per table: one for the schema (*.sql) and one with the data (*.txt.gz).
The Nature stable of journals. A byword for quality, integrity, impact. Witness this recent offering from Nature Biotechnology:
Bale, S. et al. (2011)
MutaDATABASE: a centralized and standardized DNA variation database.
Nature Biotechnology 29, 117–118
Unfortunately, although it describes an open, public database, the article itself costs $32 to read without subscription (update: it’s freely available as of one day after this post). Not to be deterred, I went to investigate MutaDATABASE itself.
The alarm bells began to ring right there on the index page (see screenshot, right).
Could that be right? I tried several browsers, in case of a rendering problem. Same result – no contents.
There seems to be something missing
Clicking on some of the links in the sidebar, I became more concerned. Here’s an example URL:
I recognise that form of URL – it comes from Joomla, a content management system. I’ve had servers compromised only twice in my career – both times, due to Joomla-based websites. Their security may have improved since, I guess – but this smacks of people looking to build a website quickly without investigating the alternatives.
It will be great. Promise.
Then, there are the spelling/grammatical errors, the “coming soons”, the “under constructions”, the news page not updated in almost 5 months. And as Tim Yates pointed out to me:
@neilfws The mutaDATABASE logo leads me to believe you are right about it being a joke.. is that someone dropping their sequences in a bin?
Who knows, MutaDatabase may turn out to be terrific. Right now though, it’s rather hard to tell. The database and web server issues of Nucleic Acids Research require that the tools described be functional for review and publication. Apparently, Nature Biotechnology does not.
GSE and GDS records in GEOmetadb by date
I was reading an old post that describes GEOmetadb, a downloadable database containing metadata from the GEO database. We had a brief discussion in the comments about the growth in GSE records (user-submitted) versus GDS records (curated datasets) over time. Below, some quick and dirty R code to examine the issue, using the Bioconductor GEOmetadb package and ggplot2. Left, the resulting image – click for larger version.
Is the curation effort keeping up with user submissions? A little difficult to say, since GEOmetadb curation seems to have its own issues: (1) why do GDS records stop in 2008? (2) why do GDS (curated) records begin earlier than GSE (submitted) records?
# update database if required using getSQLiteFile()
# connect to database; assumed to be in user $HOME
con <- dbConnect(SQLite(), "~/GEOmetadb.sqlite")
# fetch "last updated" dates for GDS and GSE
gds <- dbGetQuery(con, "select update_date from gds")
gse <- dbGetQuery(con, "select last_update_date from gse")
# cumulative sums by date; no factor variables
gds.count <- as.data.frame(cumsum(table(gds)), stringsAsFactors = F)
gse.count <- as.data.frame(cumsum(table(gse)), stringsAsFactors = F)
# make GDS and GSE data frames comparable
colnames(gds.count) <- "count"
colnames(gse.count) <- "count"
# row names (dates) to real dates
gds.count$date <- as.POSIXct(rownames(gds.count))
gse.count$date <- as.POSIXct(rownames(gse.count))
# add type for plotting
gds.count$type <- "gds"
gse.count$type <- "gse"
# combine GDS and GSE data frames
gds.gse <- rbind(gds.count, gse.count)
# and plot records over time by type
png(filename = "geometadb.png", width = 800, height = 600)
print(ggplot(gds.gse, aes(date,count)) + geom_line(aes(color = type)))
I’ve been experimenting with MongoDB’s map-reduce, called from Ruby, as a replacement for Ruby’s Enumerable methods (map/collect, inject). It’s faster. Much faster.
Next, the details but first – the disclaimers:
- My database is not optimised (using, e.g. indices)
- My non-map/reduce code is not optimised; I’m sure there are better ways to perform database queries and use Enumerable than those in this post
- My map/reduce code is not optimised – these are my first attempts
In short nothing is optimised, my code is probably awful and I’m making it up as I go along. Here we go then!
Read the rest…
During all the recent discussion around Neandertals and modern humans, it’s often pointed out that Homo sapiens is the sole extant representative of the genus Homo. I began to wonder “how unusual is this?” in a FriendFeed comment thread. What resources exist that could help us to answer this question?
Genera that contain only one species are termed monotypic. Wikipedia even has a category page for this topic but their lists are limited, since Wikipedia is not a comprehensive taxonomy resource.
Taxonomy is not my specialty but once in a while, I enjoy challenging myself with unfamiliar resources and data types. I figured initially that we could get some way towards an answer using BioSQL and the NCBI taxonomy database. As it turned out I was completely wrong, but it was an interesting educational exercise. I turned instead to a “real” taxonomy resource, the Integrated Taxonomic Information System, or ITIS.
Read the rest…