Here’s a new way to abuse biological information: take a list of gene IDs and use them to create a completely fictitious, but very convincing set of microarray probeset IDs.
This one begins with a question at BioStars, concerning the conversion of Affymetrix probeset IDs to gene names. Being a “convert ID X to ID Y” question, the obvious answer is “try BioMart” and indeed the microarray platform ([MoGene-1_0-st] Affymetrix Mouse Gene 1.0 ST) is available in the Ensembl database.
However, things get weird when we examine some example probeset IDs: 73649_at, 17921_at, 18174_at. One of the answers to the question notes that these do not map to mouse.
The data are from GEO series GSE56257. The microarray platform is GPL17777. Description: “This is identical to GPL6246 but a custom cdf environment was used to extract data. The cdf can be found at the link below.”
Uh-oh. Alarm bells.
Scrolling down to the data table we see this:
ID SPOT_ID Description 100008567_at 100008567 predicted gene 14964 100009600_at 100009600 zinc finger, GATA-like protein 1 100009609_at 100009609 vomeronasal 2, receptor 65The numerical prefix of ID matches the SPOT_ID. However, the SPOT_ID is hyperlinked – to the Entrez Gene database. So the authors have taken Entrez Gene IDs and used them to create what look like authentic Affymetrix probeset IDs – but are not. No wonder the author of this question is struggling to map them to genes.
One answer then is to use the Entrez Gene IDs in the BioMart query (see image, right, for results).
Today’s lesson then: don’t just “invent” IDs. Especially IDs that resemble real IDs, but which are not.