File under “I keep forgetting how to do this basic, frequently-required task, so I’m writing it down here.”

Let’s create a data frame which contains five variables, *vars*, named A – E, each of which appears twice, along with some measurements:

df.orig <- data.frame(vars = rep(LETTERS[1:5], 2), obs1 = c(1:10), obs2 = c(11:20))
df.orig
# vars obs1 obs2
# 1 A 1 11
# 2 B 2 12
# 3 C 3 13
# 4 D 4 14
# 5 E 5 15
# 6 A 6 16
# 7 B 7 17
# 8 C 8 18
# 9 D 9 19
# 10 E 10 20

Now, let’s say we want only the rows that contain the maximum values of *obs1* for A – E. In bioinformatics, for example, we might be interested in selecting the microarray probeset with the highest sample variance from multiple probesets per gene. The answer is obvious in this trivial example (6 – 10), but one procedure looks like this:

# use aggregate to create new data frame with the maxima
df.agg <- aggregate(obs1 ~ vars, df.orig, max)
# then simply merge with the original
df.max <- merge(df.agg, df.orig)
df.max
# vars obs1 obs2
# 1 A 6 16
# 2 B 7 17
# 3 C 8 18
# 4 D 9 19
# 5 E 10 20

This also works using *min()* and, I guess, using any function that returns a single value per variable mapping to a value in the original data frame.

With thanks to this mailing list thread.

### Like this:

Like Loading...

*Related*

I needed something like this a couple weeks ago, and this is much cleaner than what Stack Overflow came up with for me. It’s a nice solution; it doesn’t give the row numbers in the original data, which the linked discussion does.

http://stackoverflow.com/questions/14506919/how-do-i-find-the-minimum-row-number-for-each-factor-in-r/14507268

(In case anyone else has the problem I had, and sees this post.)

You can also use plyr in one step, yeh? And there’s even a progress bar.

library(plyr)

ddply(df.orig,.(var),summarise,max=max(obs1))

However, I’m having a slight amount of trouble getting the obs2 variable to show up in a sensible way. I can do

ddply(df.orig,.(vars),summarise,max=max(obs2),obs1=obs1[which.max(obs2)])

but it seems kludgy. I figured this would work, but doesn’t. Maybe a bug?

ddply(df.orig,.(vars),summarise,max=max(obs2),.drop=FALSE)

This might be an alternative:

library(plyr)

df.max2 <- ddply(.data = df.orig, .variables = .(vars), subset, obs1 == obs1[which.max(obs1)])

all.equal(df.max, df.max2)

Let the “how about this way instead?” comments commence :) No seriously, keep them coming.

even shorter:

library(plyr)

df.max3 <- ddply(df.orig, .(vars), subset, obs1==max(obs1))

all.equal(df.max2, df.max3)

If there are ties, and you only want one unique max returned per var, you may run into a problem. I had to deal with a similar problem, but in my case I wanted to pick only one true max per var. To work around this issue I use this code (still all in base R).

df.orig <- data.frame(vars = rep(LETTERS[1:5], 2), obs1 = c(1:10), obs2 = c(11:20))

df.orig <- rbind(df.orig, data.frame(vars = 'A', obs1 = '6', obs2 = '15')) ## create some ties

df.orig <- rbind(df.orig, data.frame(vars = 'A', obs1 = '6', obs2 = '16')) ## more ties

df.orig <- df.orig[order(df.orig$vars, df.orig$obs1, df.orig$obs2),]

row.names(df.orig) <- seq(1,nrow(df.orig))

x1 <- match(df.orig$vars, df.orig$vars)

index <- as.numeric(tapply(row.names(df.orig), x1, FUN=tail, n=1))

df.max <- df.orig[index,]

Even faster is to use data.table:

library(data.table)

data.table(df.orig)->dt

dt[,.SD[which.max(obs1)],by=id][]->dt.max

data.frame(dt.max)->df.max # or merge this into the above command

plyr is also good, but I recently had to do the above for about 40000 subgroups and plyr is much too slow for that. data.table is very fast.

Cool thanks for a quick examples..

Pingback: Picking one observation per ‘subject’ based on max (or min) | failuretoconverge