When life gives you coloured cells, make categories

Let’s start by making one thing clear. Using coloured cells in Excel to encode different categories of data is wrong. Next time colleagues explain excitedly how “green equals normal and red = tumour”, you must explain that (1) they have sinned and (2) what they meant to do was add a column containing the words “normal” and “tumour”.

I almost hesitate to write this post but…we have to deal with the world as it is, not as we would like it to be. So in the interests of just getting the job done: here’s one way to deal with coloured cells in Excel, should someone send them your way.

I have created a simple Excel workbook. It contains one sheet, in which you will find one column headed “x” followed by the numbers 1-10. The cells A2:A11 are filled with colour and alternate: red-green-red-green-red-green-red-green-red-green.

We’ll read it into R using the xlsx package and extract the worksheet.

library(xlsx)
wb     <- loadWorkbook("test.xlsx")
sheet1 <- getSheets(wb)[[1]]

Now we can get the rows and the cells.

# get all rows
rows  <- getRows(sheet1)
cells <- getCells(rows)
# quick look at the values
sapply(cells, getCellValue)
#  1.1  2.1  3.1  4.1  5.1  6.1  7.1  8.1  9.1 10.1 11.1 
#  "x"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9" "10" 

Colour information is contained in the cell styles.

styles <- sapply(cells, getCellStyle)

Somewhat confusingly, although the Excel GUI specifies “fill background”, the property is accessed using getFillForegroundXSSFColor().

Here’s a little function which, given a cell, returns the RGB code for the fill colour. The header cell has no fill colour, so getRgb() returns an error that we can catch.

cellColor <- function(style) {
    fg  <- style$getFillForegroundXSSFColor()
    rgb <- tryCatch(fg$getRgb(), error = function(e) NULL)
    rgb <- paste(rgb, collapse = "")
    return(rgb)
}

Applying that to our styles gives this:

sapply(styles, cellColor)
     1.1      2.1      3.1      4.1      5.1      6.1      7.1      8.1      9.1     10.1 
      "" "ff0000" "00ff00" "ff0000" "00ff00" "ff0000" "00ff00" "ff0000" "00ff00" "ff0000" 
    11.1 
"00ff00" 

Knowing that, for example, green = 00ff00 = normal and red = ff0000 = tumour, we can now generate category labels something like this:

pheno <- list(normal = "00ff00", tumour = "ff0000")
m     <- match(sapply(styles, cellColor), pheno)
labs  <-names(pheno)[m]
labs
 [1] NA       "tumour" "normal" "tumour" "normal" "tumour" "normal" "tumour" "normal"
[10] "tumour" "normal"

and get on with our lives. Or at least, our analyses.

8 thoughts on “When life gives you coloured cells, make categories

  1. Jegar

    Thank you for this! This is an issue in so many disciplines- it is so hard to get across that having good human-readable documentation doesn’t make it easily machine-readable. It is even harder to communicate that machine readable is more important that human-readable, especially when you have thousands of entries!

  2. Ricky

    do you have a way of reading in the data if the cells are conditionally formatted rather than the colour being hardcoded?

    1. nsaunders Post author

      Not sure what you mean by “conditionally formatted” – an example?

      But in general, if it’s a property of the spreadsheet XML, then it should be accessible to R.

      1. Carl Witthoft

        “Conditional formatting” means there’s a rule applied to the cell. This rule can be based on the cell’s value, the value of other cells, or just about any Excel expression which evaluates to TRUE/FALSE. When the condition is met, the specified formats, including cell color, are set.

        1. nsaunders Post author

          Good explanation, thanks (you can tell I don’t use Excel). I don’t know if that information would be accessible to R; will have to investigate.

        2. Bruce Bolt

          As the conditional formatting is based on the cell values, you could easily reproduce these using R (or your favourite language) anyway. I guess programatically accessing the rules behind the conditional formatting coud be useful though.

        3. nsaunders Post author

          Yes, a quick Google search suggests using R/xlsx to create conditional formatting. I didn’t spot anything in the cell or style properties when I looked.

  3. Pingback: Excel Roundup 20140811 « Contextures Blog

Comments are closed.