Create your own gene IDs! No wait. Don’t.

Here’s a new way to abuse biological information: take a list of gene IDs and use them to create a completely fictitious, but very convincing set of microarray probeset IDs.

This one begins with a question at BioStars, concerning the conversion of Affymetrix probeset IDs to gene names. Being a “convert ID X to ID Y” question, the obvious answer is “try BioMart” and indeed the microarray platform ([MoGene-1_0-st] Affymetrix Mouse Gene 1.0 ST) is available in the Ensembl database.

However, things get weird when we examine some example probeset IDs: 73649_at, 17921_at, 18174_at. One of the answers to the question notes that these do not map to mouse.

The data are from GEO series GSE56257. The microarray platform is GPL17777. Description: “This is identical to GPL6246 but a custom cdf environment was used to extract data. The cdf can be found at the link below.”

Uh-oh. Alarm bells.

Scrolling down to the data table we see this:

ID              SPOT_ID         Description
100008567_at    100008567       predicted gene 14964
100009600_at    100009600       zinc finger, GATA-like protein 1
100009609_at    100009609       vomeronasal 2, receptor 65

Entrez Gene ID to WikiGene Name conversion

Entrez Gene ID to WikiGene Name conversion

The numerical prefix of ID matches the SPOT_ID. However, the SPOT_ID is hyperlinked – to the Entrez Gene database. So the authors have taken Entrez Gene IDs and used them to create what look like authentic Affymetrix probeset IDs – but are not. No wonder the author of this question is struggling to map them to genes.

One answer then is to use the Entrez Gene IDs in the BioMart query (see image, right, for results).

Today’s lesson then: don’t just “invent” IDs. Especially IDs that resemble real IDs, but which are not.