Create your own gene IDs! No wait. Don’t.

Here’s a new way to abuse biological information: take a list of gene IDs and use them to create a completely fictitious, but very convincing set of microarray probeset IDs.

This one begins with a question at BioStars, concerning the conversion of Affymetrix probeset IDs to gene names. Being a “convert ID X to ID Y” question, the obvious answer is “try BioMart” and indeed the microarray platform ([MoGene-1_0-st] Affymetrix Mouse Gene 1.0 ST) is available in the Ensembl database.

However, things get weird when we examine some example probeset IDs: 73649_at, 17921_at, 18174_at. One of the answers to the question notes that these do not map to mouse.

The data are from GEO series GSE56257. The microarray platform is GPL17777. Description: “This is identical to GPL6246 but a custom cdf environment was used to extract data. The cdf can be found at the link below.”

Uh-oh. Alarm bells.
Continue reading

A minor update to my “apply functions” post

One of my more popular posts is A brief introduction to “apply” in R. Come August, it will be four years old. Technology moves on, old blog posts do not.

So: thanks to BioStar user zx8754 for pointing me to this Stack Overflow post, in which someone complains that the code in the post does not work as described. The by example is now fixed.

Side note: I often find “contact the author” is the most direct approach to solving this kind of problem ;) always happy to be contacted.

How to: remember that you once knew how to parse KEGG

Recently, someone asked me if I could generate a list of genes associated with a particular pathway. Sure, I said and hacked together some rather nasty code in R which, given a KEGG pathway identifier, used a combination of the KEGG REST API, DBGET and biomaRt to return HGNC symbols.

Coincidentally, someone asked the same question at Biostar. Pierre recommended the TogoWS REST service, which provides an API to multiple biological data sources. An article describing TogoWS was published in 2010.

An excellent suggestion – and one which, I later discovered, I had bookmarked. Twice. As long ago as 2008. This “rediscovery of things I once knew” happens to me with increasing frequency now, which makes me wonder whether (1) we really are drowning in information, (2) my online curation tools/methods require improvement or (3) my mind is not what it was. Perhaps some combination of all three.

Anyway – using Ruby (1.8.7), a list of HGNC symbols given a KEGG pathway, e.g. MAPK signaling, is as simple as:

require 'rubygems'
require 'open-uri'
require 'json/pure'

j = JSON.parse(open("http://togows.dbcls.jp/entry/pathway/hsa04010/genes.json").read)
g = j.first.values.map {|v| /^(.*?);/.match(v)[1] }
# first 5 genes
g[0..4]
# ["MAP3K14", "FGF17", "FGF6", "DUSP9", "MAP3K6"]

This code parses the JSON returned from TogoWS into an array with one element; the element is a hash with key/value pairs of the form:

"9020"=>"MAP3K14; mitogen-activated protein kinase kinase kinase 14 [KO:K04466] [EC:2.7.11.25]"

Values for all keys that I’ve seen to date begin with the HGNC symbol followed by a semicolon, making extraction quite straightforward with a simple regular expression.

Boring, monotonous day-to-day tasks? That’s synonymous with bioinformatics.

In response to this question, I can only point out that J.C.R. Licklider figured it out over 50 years ago:

Despite the fact that there is a voluminous literature on thinking and problem solving, including intensive case-history studies of the process of invention, I could find nothing comparable to a time-and-motion-study analysis of the mental work of a person engaged in a scientific or technical enterprise. In the spring and summer of 1957, therefore, I tried to keep track of what one moderately technical person actually did during the hours he regarded as devoted to work. Although I was aware of the inadequacy of the sampling, I served as my own subject.

It soon became apparent that the main thing I did was to keep records, and the project would have become an infinite regress if the keeping of records had been carried through in the detail envisaged in the initial plan. It was not. Nevertheless, I obtained a picture of my activities that gave me pause. Perhaps my spectrum is not typical–I hope it is not, but I fear it is.

About 85 per cent of my “thinking” time was spent getting into a position to think, to make a decision, to learn something I needed to know. Much more time went into finding or obtaining information than into digesting it. Hours went into the plotting of graphs, and other hours into instructing an assistant how to plot. When the graphs were finished, the relations were obvious at once, but the plotting had to be done in order to make them so. At one point, it was necessary to compare six experimental determinations of a function relating speech-intelligibility to speech-to-noise ratio. No two experimenters had used the same definition or measure of speech-to-noise ratio. Several hours of calculating were required to get the data into comparable form. When they were in comparable form, it took only a few seconds to determine what I needed to know.

Throughout the period I examined, in short, my “thinking” time was devoted mainly to activities that were essentially clerical or mechanical: searching, calculating, plotting, transforming, determining the logical or dynamic consequences of a set of assumptions or hypotheses, preparing the way for a decision or an insight. Moreover, my choices of what to attempt and what not to attempt were determined to an embarrassingly great extent by considerations of clerical feasibility, not intellectual capability.

Popular topics at the BioStar Q&A site

Which topics are the most popular at the BioStar bioinformatics Q&A site?

One source of data is the tags used for questions. Tags are somewhat arbitrary of course, but fortunately BioStar has quite an active community, so “bad” tags are usually edited to improve them. Hint: if your question is “How to find SNPs”, then tagging it with “how, to, find, snps” won’t win you any admirers.

OK: we’re going to grab the tags then use a bunch of R packages (XML, wordcloud and ggplot2) to take a quick look.

Read the rest…

New publication: analysis of large protein assemblies in macrophage cytoplasm

First for 2011:

Proteomic and electron microscopy survey of large assemblies in macrophage cytoplasm.
Maco, B., Ross I.L., Landsberg, M., Mouradov, D., Saunders, N.F.W., Hankamer, B. and Kobe, B. (2011)
Molecular & Cellular Proteomics, in press, doi:10.1074/mcp.M111.008763

This is an in-press article which is freely-available just now (although strangely, the supplemental data are not). I’m pleased to note that we also made the raw data available in Proteome Commons. In fact, it was a condition of publication.

Lots of hard work went into this one. My contribution was quite minor: some bioinformatic analysis and hacking away at PyMsXML to make it work with newer versions of vendor formats. I’d like to thank Brad Chapman with respect to PyMsXML, who provided invaluable advice via BioStar.

Real bioinformaticians write code

A lot of questions at BioStar begin along these lines:

Where can I find…?
I am looking for a resource…?
Is there some database…?

I tweeted some concerns about this:

Many #biostar questions begin “I am looking for a resource..”. The answer is often that you need to code a solution using the data you have.

Chris tweeted back:

@neilfws Lit. or Google search is first step, asking around is the next logical step. (Re-)inventing wheels is last. Worth asking, IMHO.

We had a little chat and I realised that 140 characters or less was not getting my point across (not for the first time). What I was trying to say was something like this.
Read the rest…