File under “I keep forgetting how to do this basic, frequently-required task, so I’m writing it down here.”
Let’s create a data frame which contains five variables, vars, named A – E, each of which appears twice, along with some measurements:
df.orig <- data.frame(vars = rep(LETTERS[1:5], 2), obs1 = c(1:10), obs2 = c(11:20)) df.orig # vars obs1 obs2 # 1 A 1 11 # 2 B 2 12 # 3 C 3 13 # 4 D 4 14 # 5 E 5 15 # 6 A 6 16 # 7 B 7 17 # 8 C 8 18 # 9 D 9 19 # 10 E 10 20
Now, let’s say we want only the rows that contain the maximum values of obs1 for A – E. In bioinformatics, for example, we might be interested in selecting the microarray probeset with the highest sample variance from multiple probesets per gene. The answer is obvious in this trivial example (6 – 10), but one procedure looks like this:
# use aggregate to create new data frame with the maxima df.agg <- aggregate(obs1 ~ vars, df.orig, max) # then simply merge with the original df.max <- merge(df.agg, df.orig) df.max # vars obs1 obs2 # 1 A 6 16 # 2 B 7 17 # 3 C 8 18 # 4 D 9 19 # 5 E 10 20
This also works using min() and, I guess, using any function that returns a single value per variable mapping to a value in the original data frame.
With thanks to this mailing list thread.
I needed something like this a couple weeks ago, and this is much cleaner than what Stack Overflow came up with for me. It’s a nice solution; it doesn’t give the row numbers in the original data, which the linked discussion does.
http://stackoverflow.com/questions/14506919/how-do-i-find-the-minimum-row-number-for-each-factor-in-r/14507268
(In case anyone else has the problem I had, and sees this post.)
You can also use plyr in one step, yeh? And there’s even a progress bar.
library(plyr)
ddply(df.orig,.(var),summarise,max=max(obs1))
However, I’m having a slight amount of trouble getting the obs2 variable to show up in a sensible way. I can do
ddply(df.orig,.(vars),summarise,max=max(obs2),obs1=obs1[which.max(obs2)])
but it seems kludgy. I figured this would work, but doesn’t. Maybe a bug?
ddply(df.orig,.(vars),summarise,max=max(obs2),.drop=FALSE)
This might be an alternative:
library(plyr)
df.max2 <- ddply(.data = df.orig, .variables = .(vars), subset, obs1 == obs1[which.max(obs1)])
all.equal(df.max, df.max2)
Let the “how about this way instead?” comments commence :) No seriously, keep them coming.
even shorter:
library(plyr)
df.max3 <- ddply(df.orig, .(vars), subset, obs1==max(obs1))
all.equal(df.max2, df.max3)
If there are ties, and you only want one unique max returned per var, you may run into a problem. I had to deal with a similar problem, but in my case I wanted to pick only one true max per var. To work around this issue I use this code (still all in base R).
df.orig <- data.frame(vars = rep(LETTERS[1:5], 2), obs1 = c(1:10), obs2 = c(11:20))
df.orig <- rbind(df.orig, data.frame(vars = 'A', obs1 = '6', obs2 = '15')) ## create some ties
df.orig <- rbind(df.orig, data.frame(vars = 'A', obs1 = '6', obs2 = '16')) ## more ties
df.orig <- df.orig[order(df.orig$vars, df.orig$obs1, df.orig$obs2),]
row.names(df.orig) <- seq(1,nrow(df.orig))
x1 <- match(df.orig$vars, df.orig$vars)
index <- as.numeric(tapply(row.names(df.orig), x1, FUN=tail, n=1))
df.max <- df.orig[index,]
Even faster is to use data.table:
library(data.table)
data.table(df.orig)->dt
dt[,.SD[which.max(obs1)],by=id][]->dt.max
data.frame(dt.max)->df.max # or merge this into the above command
plyr is also good, but I recently had to do the above for about 40000 subgroups and plyr is much too slow for that. data.table is very fast.
Cool thanks for a quick examples..
Pingback: Picking one observation per ‘subject’ based on max (or min) | failuretoconverge