“Take a look at the TP53 mutation database“, my colleague suggested. “OK then, I will”, I replied.
I present what follows as “a typical day in the life of a bioinformatician”.
This post is an apology and an attempt to make amends for contributing to the decay of online bioinformatics resources. It’s also, I think, a nice example of why reproducible research can be difficult.
Come back in time with me 10 years, to 2004.
Over the years, I’ve written a lot of small “utility scripts”. You know the kind of thing. Little code snippets that facilitate research, rather than generate research results. For example: just what are the fields that you can use to qualify Entrez database searches?
Typically, they end up languishing in long-forgotten Dropbox directories. Sometimes, the output gets shared as a public link. No longer! As of today, “little code snippets that do (hopefully) useful things” have a new home at Github.
Also as of today: there’s not much there right now, just the aforementioned Entrez database code and output. I’m not out to change the world here, just to do a little better.
Just a brief technical note.
I figured that for a given compound in PubChem, it would be interesting to know whether that compound had been used in a high-throughput experiment, which you might find in GEO. Very easy using the E-utilities, as implemented in the R package rentrez:
library(rentrez) links <- entrez_link(dbfrom = "pccompound", db = "gds", id = "62857") length(links$pccompound_gds) #  741
Browsing the rentrez documentation, I note that db can take the value “all”. Sounds useful!
links <- entrez_link(dbfrom = "pccompound", db = "all", id = "62857") length(links$pccompound_gds) #  0
That’s odd. In fact, this query does not even link pccompound to gds:
length(names(links)) #  39 which(names(links) == "pccompound_gds") # integer(0)
It’s not a rentrez issue, since the same result occurs using the E-utilities URL.
Next week I’ll be in Melbourne for one of my favourite meetings, the annual Computational and Simulation Sciences and eResearch Conference.
The main reason for my visit is the Bioinformatics FOAM workshop. Day 1 (March 27) is not advertised since it is an internal CSIRO day, but I’ll be presenting a talk titled “SQL, noSQL or no database at all? Are databases still a core skill?“. Day 2 (March 28) is open to all and I’ll be talking about “Learning from complete strangers: social networking for bioinformaticians“.
Hope to see some of you there.
I’m pleased to announce an open-access publication with my name on it:
Mitchell, S.M., Ross, J.P., Drew, H.R., Ho, T., Brown, G.S., Saunders, N.F.W., Duesing, K.R., Buckley, M.J., Dunne, R., Beetson, I., Rand, K.N., McEvoy, A., Thomas, M.L., Baker, R.T., Wattchow, D.A., Young, G.P., Lockett, T.J., Pedersen, S.K., LaPointe L.C. and Molloy, P.L. (2014). A panel of genes methylated with high frequency in colorectal cancer. BMC Cancer 14:54.
So, I read the title:
and skimmed the abstract:
The scientific literature contains millions of microbial gene identifiers within the full text and tables, but these annotations rarely get incorporated into public sequence databases.
and thought, well OK, but wouldn’t it be better to incorporate annotations in the first place – when submitting to the public databases – rather than by this indirect method?
The point, of course, is to incorporate new findings from the literature into existing records, rather than to use the tool as a primary method of annotation. I do believe that public databases could do more to enforce data quality standards at deposition time, but that’s an entirely separate issue.
Big thanks to Michael Hoffman for a spirited Twitter discussion that put me straight.
I enjoyed this story from the OpenHelix blog today, describing a Microsoft Research project to mine DNA sequences from web pages and map them to UCSC genome builds.
Laura DeMare asks: what was the most-hit gene?
If you looked at that and thought “Hey, that’s a heat map!”, you are correct. That is a heat map. Let’s be quite clear about that. It’s a heat map.
So, how do the authors justify publishing a method for drawing heat maps and then calling them “quilt plots”?
Read the rest…
Laboratory work, of the “wet” kind, not working out for you? Or perhaps you just need new challenges. Think you have some aptitude with data analysis, computers, mathematics, statistics? Maybe a switch to computational biology is what you need.
That’s the topic of the Nature Careers feature “Computing: Out of the hood“. With thoughts and advice from (on Twitter) @caseybergman, @sarahmhird, @kcranstn, @PavelTomancak, @ctitusbrown and myself.
I enjoyed talking with Roberta and she did a good job of capturing our thoughts for the article. One of these days, I might even write here about my own journey in more detail.