June 23, 2004. BMC Bioinformatics publishes “Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics”. We roll our eyes. Do people really do that? Is it really worthy of publication? However, we admit that if it happens then it’s good that people know about it.
October 17, 2012. A colleague on our internal Yammer network writes:
Sad but true. I keep finding newbie bioinformatics errors in the Cancer Genome Atlas project data. This time a text download of 450K methylation from the Cancer Genome Atlas project reveals that Excel has had its evil way with the data at some point. Gene names such as MAR1, DEC1, OCT4 and SEPT9 are now reformatted as dates.
For example:barcode probe name beta value gene symbol chromosome position TCGA-06-0125-02A-11D-2004-05 cg13918206 0.92035091902012 1-Dec 9 118159781
I click through the CGA data portal in search of more datasets and choose, more or less at random, another file containing data from the Illumina 450K platform. It’s called jhu-usc.edu__HumanMethylation450__TCGA-G4-6628-01A-11D-1837-05__methylation_analysis.txt. Let’s get that into R:
tcga <- read.table("jhu-usc.edu__HumanMethylation450__TCGA-G4-6628-01A-11D-1837-05__methylation_analysis.txt", sep = "\t", header = T) dim(tcga) #  485577 6 head(tcga$gene.symbol) #  1-Dec 1-Dec 1-Dec 1-Dec 1-Dec 1-Dec # 25107 Levels: 10-Mar 11-Mar 11-Sep 13-Sep 14-Sep 1-Dec 1-Mar 1-Sep ... ZZZ3
Seems we have a problem. Let’s count up gene names:
genes <- as.data.frame(table(tcga$gene.symbol)) head(genes, 20) Var1 Freq # 1 119652 # 2 10-Mar 5 # 3 11-Mar 21 # 4 11-Sep 32 # 5 13-Sep 18 # 6 14-Sep 4 # 7 1-Dec 8 # 8 1-Mar 30 # 9 1-Sep 10 # 10 3-Mar 33 # 11 4-Mar 25 # 12 5-Mar 4 # 13 5-Sep 12 # 14 6-Mar 21 # 15 7-Mar 13 # 16 8-Sep 2 # 17 9-Mar 6 # 18 9-Sep 16 # 19 A1BG 6 # 20 A1CF 5
Yes, we have a problem.
“Newbie bioinformaticians” is one thing. Large institutes, awarded millions of dollars to contribute to “big science” projects is another.
Despair at the quality of public data, fears about reproducibility in science. Must be Monday.
6 thoughts on “Gene name errors and Excel: lessons not learned”
Omg I have been trying to educate colleagues about this error for years and most people think it is boring. It is sad that it still happens and certainly in these big projects.
Pingback: Software Carpentry » Excel Isn’t Intrinsically Evil
I will admit I consider myself a “newbie bioinformatician”, and though I’ve figured this particular problem out for myself, I think it’s important to at least state it publicly to prevent other newbies from falling in the same old trap. Of course, avoiding Excel and all of its project-ruining ways is another option.
Don’t assume this problem only affects newbie bioinformaticians or Excel users – this error likely originated from the platform annotation for this chip from GEO, and I’ve found it to be widespread even in official annotations. If you’re using gene symbols from any source, it would be a good idea to run them through the HGNChelper R package to identify and correct unofficial gene symbols.
(trying again with the sourcecode markup)
The Data Coordinating Center (DCC) of TCGA has long been aware of this issue. As we have encountered excel genes, we ask the centers that submit the data to resubmit corrected files. The corrected files become the latest version of the data, and DCC applications only provide the latest versions. However, some older versions of data can be accessed directly via wget or other means, and some of these obsolete files do have the problem–as noted in this blog.
A scan last week of 125000 latest level 3 data was clean for excel genes. A scan of older versions (175000 files) did reveal a few that contain excel genes.
My own opinion (not that of TCGA or the DCC, but based on my experience there) is that in a large program like this, involving many different institutions with different levels of bioinformatics experience, true, but also many different agendas and priorities, you are going to have an non-zero level of issues like this at any given time. Since our job is to keep the formats clean and bright, I appreciate this critique.
Anyone can raise format issues or questions about TCGA data by writing TCGA-DCC-BINF-L@list.nih.gov, and I encourage you to do so.
I posted how to avoid this problem during an Excel load of text gene symbols at http://badassdatascience.com/2013/01/03/excel-mangles-ncbi-gene-symbols/ for anybody interested.
Comments are closed.