Basic R: rows that contain the maximum value of a variable

February 13, 2013 / nsaunders

File under “I keep forgetting how to do this basic, frequently-required task, so I’m writing it down here.”

Let’s create a data frame which contains five variables, vars, named A – E, each of which appears twice, along with some measurements:

df.orig <- data.frame(vars = rep(LETTERS[1:5], 2), obs1 = c(1:10), obs2 = c(11:20))
df.orig
#    vars obs1 obs2
# 1     A    1   11
# 2     B    2   12
# 3     C    3   13
# 4     D    4   14
# 5     E    5   15
# 6     A    6   16
# 7     B    7   17
# 8     C    8   18
# 9     D    9   19
# 10    E   10   20

Now, let’s say we want only the rows that contain the maximum values of obs1 for A – E. In bioinformatics, for example, we might be interested in selecting the microarray probeset with the highest sample variance from multiple probesets per gene. The answer is obvious in this trivial example (6 – 10), but one procedure looks like this:

# use aggregate to create new data frame with the maxima
df.agg <- aggregate(obs1 ~ vars, df.orig, max)
# then simply merge with the original
df.max <- merge(df.agg, df.orig)
df.max
#   vars obs1 obs2
# 1    A    6   16
# 2    B    7   17
# 3    C    8   18
# 4    D    9   19
# 5    E   10   20

This also works using min() and, I guess, using any function that returns a single value per variable mapping to a value in the original data frame.

With thanks to this mailing list thread.

9 thoughts on “Basic R: rows that contain the maximum value of a variable”

David Manheim

February 13, 2013 at 14:59

I needed something like this a couple weeks ago, and this is much cleaner than what Stack Overflow came up with for me. It’s a nice solution; it doesn’t give the row numbers in the original data, which the linked discussion does.

http://stackoverflow.com/questions/14506919/how-do-i-find-the-minimum-row-number-for-each-factor-in-r/14507268

(In case anyone else has the problem I had, and sees this post.)
benjwoodcroft

February 13, 2013 at 17:30

You can also use plyr in one step, yeh? And there’s even a progress bar.

library(plyr)
ddply(df.orig,.(var),summarise,max=max(obs1))

However, I’m having a slight amount of trouble getting the obs2 variable to show up in a sensible way. I can do

ddply(df.orig,.(vars),summarise,max=max(obs2),obs1=obs1[which.max(obs2)])

but it seems kludgy. I figured this would work, but doesn’t. Maybe a bug?

ddply(df.orig,.(vars),summarise,max=max(obs2),.drop=FALSE)
Henrik

February 13, 2013 at 17:55

This might be an alternative:

library(plyr)
df.max2 <- ddply(.data = df.orig, .variables = .(vars), subset, obs1 == obs1[which.max(obs1)])
all.equal(df.max, df.max2)
nsaunders

February 13, 2013 at 18:49

Let the “how about this way instead?” comments commence :) No seriously, keep them coming.
Matthieu Dubois

February 13, 2013 at 21:12

even shorter:
library(plyr)
df.max3 <- ddply(df.orig, .(vars), subset, obs1==max(obs1))
all.equal(df.max2, df.max3)
connerpharmd

February 14, 2013 at 03:04

If there are ties, and you only want one unique max returned per var, you may run into a problem. I had to deal with a similar problem, but in my case I wanted to pick only one true max per var. To work around this issue I use this code (still all in base R).

df.orig <- data.frame(vars = rep(LETTERS[1:5], 2), obs1 = c(1:10), obs2 = c(11:20))
df.orig <- rbind(df.orig, data.frame(vars = 'A', obs1 = '6', obs2 = '15')) ## create some ties
df.orig <- rbind(df.orig, data.frame(vars = 'A', obs1 = '6', obs2 = '16')) ## more ties

df.orig <- df.orig[order(df.orig$vars, df.orig$obs1, df.orig$obs2),]
row.names(df.orig) <- seq(1,nrow(df.orig))
x1 <- match(df.orig$vars, df.orig$vars)
index <- as.numeric(tapply(row.names(df.orig), x1, FUN=tail, n=1))
df.max <- df.orig[index,]
Jeremy

February 14, 2013 at 12:49

Even faster is to use data.table:
library(data.table)
data.table(df.orig)->dt
dt[,.SD[which.max(obs1)],by=id][]->dt.max
data.frame(dt.max)->df.max # or merge this into the above command

plyr is also good, but I recently had to do the above for about 40000 subgroups and plyr is much too slow for that. data.table is very fast.
Arthur Aguirre

February 15, 2013 at 00:39

Cool thanks for a quick examples..
Pingback: Picking one observation per ‘subject’ based on max (or min) | failuretoconverge