Basic R: rows that contain the maximum value of a variable

File under “I keep forgetting how to do this basic, frequently-required task, so I’m writing it down here.”

Let’s create a data frame which contains five variables, vars, named A – E, each of which appears twice, along with some measurements:

df.orig <- data.frame(vars = rep(LETTERS[1:5], 2), obs1 = c(1:10), obs2 = c(11:20))
df.orig
#    vars obs1 obs2
# 1     A    1   11
# 2     B    2   12
# 3     C    3   13
# 4     D    4   14
# 5     E    5   15
# 6     A    6   16
# 7     B    7   17
# 8     C    8   18
# 9     D    9   19
# 10    E   10   20

Now, let’s say we want only the rows that contain the maximum values of obs1 for A – E. In bioinformatics, for example, we might be interested in selecting the microarray probeset with the highest sample variance from multiple probesets per gene. The answer is obvious in this trivial example (6 – 10), but one procedure looks like this:

# use aggregate to create new data frame with the maxima
df.agg <- aggregate(obs1 ~ vars, df.orig, max)
# then simply merge with the original
df.max <- merge(df.agg, df.orig)
df.max
#   vars obs1 obs2
# 1    A    6   16
# 2    B    7   17
# 3    C    8   18
# 4    D    9   19
# 5    E   10   20

This also works using min() and, I guess, using any function that returns a single value per variable mapping to a value in the original data frame.

With thanks to this mailing list thread.

9 thoughts on “Basic R: rows that contain the maximum value of a variable

  1. You can also use plyr in one step, yeh? And there’s even a progress bar.

    library(plyr)
    ddply(df.orig,.(var),summarise,max=max(obs1))

    However, I’m having a slight amount of trouble getting the obs2 variable to show up in a sensible way. I can do

    ddply(df.orig,.(vars),summarise,max=max(obs2),obs1=obs1[which.max(obs2)])

    but it seems kludgy. I figured this would work, but doesn’t. Maybe a bug?

    ddply(df.orig,.(vars),summarise,max=max(obs2),.drop=FALSE)

  2. This might be an alternative:

    library(plyr)
    df.max2 <- ddply(.data = df.orig, .variables = .(vars), subset, obs1 == obs1[which.max(obs1)])
    all.equal(df.max, df.max2)

  3. even shorter:
    library(plyr)
    df.max3 <- ddply(df.orig, .(vars), subset, obs1==max(obs1))
    all.equal(df.max2, df.max3)

  4. If there are ties, and you only want one unique max returned per var, you may run into a problem. I had to deal with a similar problem, but in my case I wanted to pick only one true max per var. To work around this issue I use this code (still all in base R).

    df.orig <- data.frame(vars = rep(LETTERS[1:5], 2), obs1 = c(1:10), obs2 = c(11:20))
    df.orig <- rbind(df.orig, data.frame(vars = 'A', obs1 = '6', obs2 = '15')) ## create some ties
    df.orig <- rbind(df.orig, data.frame(vars = 'A', obs1 = '6', obs2 = '16')) ## more ties

    df.orig <- df.orig[order(df.orig$vars, df.orig$obs1, df.orig$obs2),]
    row.names(df.orig) <- seq(1,nrow(df.orig))
    x1 <- match(df.orig$vars, df.orig$vars)
    index <- as.numeric(tapply(row.names(df.orig), x1, FUN=tail, n=1))
    df.max <- df.orig[index,]

  5. Even faster is to use data.table:
    library(data.table)
    data.table(df.orig)->dt
    dt[,.SD[which.max(obs1)],by=id][]->dt.max
    data.frame(dt.max)->df.max # or merge this into the above command

    plyr is also good, but I recently had to do the above for about 40000 subgroups and plyr is much too slow for that. data.table is very fast.

  6. Pingback: Picking one observation per ‘subject’ based on max (or min) | failuretoconverge

Comments are closed.