R——如何计算数据框列表的组均值,使用不同的子集条件来计算每个均值?
R -- How can I calculate group means for a list of data frames, using a different subset condition to calculate each mean?
我有一个包含三个数据框的列表,我想生成另一个包含三个数据框的列表,其行由分组变量 (g1) 的每个值和 g1 变量的六个变量的平均值组成。不同之处在于,我只想在相应虚拟变量的值等于 1 时计算三个连续变量的均值。
可重现的例子:
a <- data.frame(c("fj","fj","fj","a","fj","a","g","g","g","g"),c(1,1,1,1,0,0,0,1,0,0),c(0,0,1,0,1,0,0,1,0,1),c(0,0,0,1,0,0,1,1,0,0),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
b <- data.frame(c("fj","a","fj","a","fj","fj","fj","g","g","g"),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
c <- data.frame(c("fj","fj","fj","a","fj","a","g","g","g","g"),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
u <- list(a,b,c)
u <- lapply(u, setNames, nm = c('g1','dummy1','dummy2','dummy3','contin1','contin2','contin3'))
u[[1]]
> u
[[1]]
g1 dummy1 dummy2 dummy3 contin1 contin2 contin3
1 fj 1 0 0 199 18 61
2 fj 1 0 0 91 158 28
3 fj 1 1 0 147 67 190
4 a 1 0 1 181 105 22
5 fj 0 1 0 14 16 156
6 a 0 0 0 178 14 98
7 g 0 0 1 116 97 30
8 g 1 1 1 48 31 144
9 g 0 0 0 60 21 112
10 g 0 1 0 95 145 199
我想仅在 dummy1 = 1 时计算 contin1 的平均值,仅在 dummy2 = 1 时计算 contin2 的平均值,仅在 dummy3 = 1 时计算 contin3 的平均值
第一个列表我想要的输出:
> rates
[[1]]
x[, 1] V1 V2 V3 x[, 1] x[, 6] x[, 1] x[, 7] x[, 1] x[, 8]
1 a 0.50 0.0 0.5 a 181 a NA a 22
2 fj 0.75 0.5 0.0 fj 145.67 fj 41.5 fj NA
3 g 0.25 0.5 0.5 g 48 g 88 g 87
我尝试过的:
rates <- lapply(u, function(x) {
cbind(aggregate(cbind(x[,2],x[,3],x[,4]) ~ x[,1], FUN = mean, na.action = NULL),
aggregate(x[,6] ~ x[,1], FUN = mean, na.action = NULL, subset = (x[,2] == 1)),
aggregate(x[,7] ~ x[,1], FUN = mean, na.action = NULL, subset = (x[,3] == 1)),
aggregate(x[,8] ~ x[,1], FUN = mean, na.action = NULL, subset = (x[,4] == 1)))
})
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 3, 2
我知道这个错误来自 cbind,因为每当您尝试 cbind 具有不同行数的对象时,cbind 都会失败。 (列 x[ 6] 有三行,而 x[ 7] 和 x[ 8] 有两行。)我想我希望聚合有某种方法可以让每个分组变量保留一行,这意味着我将拥有相同的行数并且 cbind 可以工作。也许根据 R 文档这是不可能的?:"Rows with missing values in any of the by variables will be omitted from the result."
我已经咖啡性地阅读了 aggregate 的文档。以下两篇文章解决了类似的问题,但没有使用不同的数据子集来计算均值。
R: Calculate means for subset of a group 和
Means from a list of data frames in R
如有任何建议,我们将不胜感激。
如果你安装了 dplyr,下面的代码似乎可以解决你的问题。
library(dplyr)
set.seed(1234)
a <- data.frame(c("fj","fj","fj","a","fj","a","g","g","g","g"),c(1,1,1,1,0,0,0,1,0,0),c(0,0,1,0,1,0,0,1,0,1),c(0,0,0,1,0,0,1,1,0,0),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
b <- data.frame(c("fj","a","fj","a","fj","fj","fj","g","g","g"),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
c <- data.frame(c("fj","fj","fj","a","fj","a","g","g","g","g"),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
u <- list(a,b,c)
u <- lapply(u, setNames, nm = c('g1','dummy1','dummy2','dummy3','contin1','contin2','contin3'))
rates <- lapply(u, function(x)
x %>%
mutate( contin1_ = ifelse(dummy1==1, contin1, NA) ) %>%
mutate( contin2_ = ifelse(dummy2==1, contin2, NA) ) %>%
mutate( contin3_ = ifelse(dummy3==1, contin3, NA) ) %>%
group_by(g1) %>%
summarize(
V1 = mean(dummy1, na.rm=TRUE),
V2 = mean(dummy2, na.rm=TRUE),
V3 = mean(dummy3, na.rm=TRUE),
mean1 = mean(contin1_, na.rm=TRUE),
mean2 = mean(contin2_, na.rm=TRUE),
mean3 = mean(contin3_, na.rm=TRUE)
)
)
print(rates[[1]])
这给了我这个:
Source: local data frame [3 x 7]
g1 V1 V2 V3 mean1 mean2 mean3
1 a 0.50 0.0 0.5 128.00000 NaN 17
2 fj 0.75 0.5 0.0 94.66667 64 NaN
3 g 0.25 0.5 0.5 54.00000 57 146
我得到的数字似乎大致正确,NA 在所有正确的地方。不幸的是,您的示例不能完全重现,因为您没有指定用于生成随机变量的种子,因此,我的 runif 给我的值与您的不同。
另一种选择是将格式从 'wide' 更改为 'long',并在获得 'mean' 值后重新转换回 'wide'。对于多值列,现在可以使用 melt
、dcast
从 data.table
的开发版本,即 v1.9.5
。它可以从 here
安装。 (使用来自@akhmed post 的相同数据集)。
我们可以 melt
列表 ('u') 中的数据集,方法是将 measure.vars
中的列('dummy' 和 'contin')的索引指定为一个列表。获取按 'g1' 分组的 'dummy' 和 'contin' 列的平均值,以及 'variable' (从 'melt' 创建), dcast
来自 long
到 wide
,方法是将 value.vars 指定为 'dummyMean' 和 'continMean'。
res <- lapply(u, function(x) {
x1 <- melt(setDT(x), measure.vars=list(2:4,5:7),
value.name=c('dummy', 'contin'))
x2 <- x1[, list(dummyMean = mean(dummy, na.rm=TRUE),
continMean = mean(contin[dummy==1], na.rm=TRUE)),
by=list(g1, variable)]
dcast(x2, g1~variable, value.var=c('dummyMean', 'continMean'))})
res[[1]]
# g1 1_dummyMean 2_dummyMean 3_dummyMean 1_continMean 2_continMean
#1: a 0.50 0.0 0.5 128.00000 NaN
#2: fj 0.75 0.5 0.0 94.66667 64
#3: g 0.25 0.5 0.5 54.00000 57
# 3_continMean
#1: 17
#2: NaN
#3: 146
或使用 Map
的 base R
选项。创建函数 'fdummy'、'fcontin' 以对 'dummy' 和 'contin' 列进行子集化。循环 'u' (lapply(...)
)。用Map
得到'dummy'和[=43=对应的列,按'g1'列分组,得到'dummy'和[=26的mean
=] 的 'contin' 列 'dummy==1' 使用 tapply
、cbind
结果。
fdummy <- function(x) x[grep('dummy', names(x))]
fcontin <- function(x) x[grep('contin', names(x))]
res2 <- lapply(u, function(x) {
do.call(cbind.data.frame,
Map(function(x,y,z) cbind(tapply(x,z, FUN=mean),
tapply(y[x==1],z[x==1], FUN=mean)),
fdummy(x), fcontin(x), x['g1']))})
lapply(res2, setNames, c(rbind(paste0('dummyMean', 1:3),
paste0('continMean',1:3))))[[1]]
# dummyMean1 continMean1 dummyMean2 continMean2 dummyMean3 continMean3
#a 0.50 128.00000 0.0 NA 0.5 17
#fj 0.75 94.66667 0.5 64 0.0 NA
#g 0.25 54.00000 0.5 57 0.5 146
我有一个包含三个数据框的列表,我想生成另一个包含三个数据框的列表,其行由分组变量 (g1) 的每个值和 g1 变量的六个变量的平均值组成。不同之处在于,我只想在相应虚拟变量的值等于 1 时计算三个连续变量的均值。
可重现的例子:
a <- data.frame(c("fj","fj","fj","a","fj","a","g","g","g","g"),c(1,1,1,1,0,0,0,1,0,0),c(0,0,1,0,1,0,0,1,0,1),c(0,0,0,1,0,0,1,1,0,0),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
b <- data.frame(c("fj","a","fj","a","fj","fj","fj","g","g","g"),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
c <- data.frame(c("fj","fj","fj","a","fj","a","g","g","g","g"),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
u <- list(a,b,c)
u <- lapply(u, setNames, nm = c('g1','dummy1','dummy2','dummy3','contin1','contin2','contin3'))
u[[1]]
> u
[[1]]
g1 dummy1 dummy2 dummy3 contin1 contin2 contin3
1 fj 1 0 0 199 18 61
2 fj 1 0 0 91 158 28
3 fj 1 1 0 147 67 190
4 a 1 0 1 181 105 22
5 fj 0 1 0 14 16 156
6 a 0 0 0 178 14 98
7 g 0 0 1 116 97 30
8 g 1 1 1 48 31 144
9 g 0 0 0 60 21 112
10 g 0 1 0 95 145 199
我想仅在 dummy1 = 1 时计算 contin1 的平均值,仅在 dummy2 = 1 时计算 contin2 的平均值,仅在 dummy3 = 1 时计算 contin3 的平均值
第一个列表我想要的输出:
> rates
[[1]]
x[, 1] V1 V2 V3 x[, 1] x[, 6] x[, 1] x[, 7] x[, 1] x[, 8]
1 a 0.50 0.0 0.5 a 181 a NA a 22
2 fj 0.75 0.5 0.0 fj 145.67 fj 41.5 fj NA
3 g 0.25 0.5 0.5 g 48 g 88 g 87
我尝试过的:
rates <- lapply(u, function(x) {
cbind(aggregate(cbind(x[,2],x[,3],x[,4]) ~ x[,1], FUN = mean, na.action = NULL),
aggregate(x[,6] ~ x[,1], FUN = mean, na.action = NULL, subset = (x[,2] == 1)),
aggregate(x[,7] ~ x[,1], FUN = mean, na.action = NULL, subset = (x[,3] == 1)),
aggregate(x[,8] ~ x[,1], FUN = mean, na.action = NULL, subset = (x[,4] == 1)))
})
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 3, 2
我知道这个错误来自 cbind,因为每当您尝试 cbind 具有不同行数的对象时,cbind 都会失败。 (列 x[ 6] 有三行,而 x[ 7] 和 x[ 8] 有两行。)我想我希望聚合有某种方法可以让每个分组变量保留一行,这意味着我将拥有相同的行数并且 cbind 可以工作。也许根据 R 文档这是不可能的?:"Rows with missing values in any of the by variables will be omitted from the result."
我已经咖啡性地阅读了 aggregate 的文档。以下两篇文章解决了类似的问题,但没有使用不同的数据子集来计算均值。
R: Calculate means for subset of a group 和 Means from a list of data frames in R
如有任何建议,我们将不胜感激。
如果你安装了 dplyr,下面的代码似乎可以解决你的问题。
library(dplyr)
set.seed(1234)
a <- data.frame(c("fj","fj","fj","a","fj","a","g","g","g","g"),c(1,1,1,1,0,0,0,1,0,0),c(0,0,1,0,1,0,0,1,0,1),c(0,0,0,1,0,0,1,1,0,0),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
b <- data.frame(c("fj","a","fj","a","fj","fj","fj","g","g","g"),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
c <- data.frame(c("fj","fj","fj","a","fj","a","g","g","g","g"),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
u <- list(a,b,c)
u <- lapply(u, setNames, nm = c('g1','dummy1','dummy2','dummy3','contin1','contin2','contin3'))
rates <- lapply(u, function(x)
x %>%
mutate( contin1_ = ifelse(dummy1==1, contin1, NA) ) %>%
mutate( contin2_ = ifelse(dummy2==1, contin2, NA) ) %>%
mutate( contin3_ = ifelse(dummy3==1, contin3, NA) ) %>%
group_by(g1) %>%
summarize(
V1 = mean(dummy1, na.rm=TRUE),
V2 = mean(dummy2, na.rm=TRUE),
V3 = mean(dummy3, na.rm=TRUE),
mean1 = mean(contin1_, na.rm=TRUE),
mean2 = mean(contin2_, na.rm=TRUE),
mean3 = mean(contin3_, na.rm=TRUE)
)
)
print(rates[[1]])
这给了我这个:
Source: local data frame [3 x 7]
g1 V1 V2 V3 mean1 mean2 mean3
1 a 0.50 0.0 0.5 128.00000 NaN 17
2 fj 0.75 0.5 0.0 94.66667 64 NaN
3 g 0.25 0.5 0.5 54.00000 57 146
我得到的数字似乎大致正确,NA 在所有正确的地方。不幸的是,您的示例不能完全重现,因为您没有指定用于生成随机变量的种子,因此,我的 runif 给我的值与您的不同。
另一种选择是将格式从 'wide' 更改为 'long',并在获得 'mean' 值后重新转换回 'wide'。对于多值列,现在可以使用 melt
、dcast
从 data.table
的开发版本,即 v1.9.5
。它可以从 here
安装。 (使用来自@akhmed post 的相同数据集)。
我们可以 melt
列表 ('u') 中的数据集,方法是将 measure.vars
中的列('dummy' 和 'contin')的索引指定为一个列表。获取按 'g1' 分组的 'dummy' 和 'contin' 列的平均值,以及 'variable' (从 'melt' 创建), dcast
来自 long
到 wide
,方法是将 value.vars 指定为 'dummyMean' 和 'continMean'。
res <- lapply(u, function(x) {
x1 <- melt(setDT(x), measure.vars=list(2:4,5:7),
value.name=c('dummy', 'contin'))
x2 <- x1[, list(dummyMean = mean(dummy, na.rm=TRUE),
continMean = mean(contin[dummy==1], na.rm=TRUE)),
by=list(g1, variable)]
dcast(x2, g1~variable, value.var=c('dummyMean', 'continMean'))})
res[[1]]
# g1 1_dummyMean 2_dummyMean 3_dummyMean 1_continMean 2_continMean
#1: a 0.50 0.0 0.5 128.00000 NaN
#2: fj 0.75 0.5 0.0 94.66667 64
#3: g 0.25 0.5 0.5 54.00000 57
# 3_continMean
#1: 17
#2: NaN
#3: 146
或使用 Map
的 base R
选项。创建函数 'fdummy'、'fcontin' 以对 'dummy' 和 'contin' 列进行子集化。循环 'u' (lapply(...)
)。用Map
得到'dummy'和[=43=对应的列,按'g1'列分组,得到'dummy'和[=26的mean
=] 的 'contin' 列 'dummy==1' 使用 tapply
、cbind
结果。
fdummy <- function(x) x[grep('dummy', names(x))]
fcontin <- function(x) x[grep('contin', names(x))]
res2 <- lapply(u, function(x) {
do.call(cbind.data.frame,
Map(function(x,y,z) cbind(tapply(x,z, FUN=mean),
tapply(y[x==1],z[x==1], FUN=mean)),
fdummy(x), fcontin(x), x['g1']))})
lapply(res2, setNames, c(rbind(paste0('dummyMean', 1:3),
paste0('continMean',1:3))))[[1]]
# dummyMean1 continMean1 dummyMean2 continMean2 dummyMean3 continMean3
#a 0.50 128.00000 0.0 NA 0.5 17
#fj 0.75 94.66667 0.5 64 0.0 NA
#g 0.25 54.00000 0.5 57 0.5 146