R 查询 - 是否可以一起使用 "sapply" 和 "weighted.mean" 函数?
R query - Is it possible to use "sapply" and the "weighted.mean" function together?
我一直在使用代码 运行 表示特定变量值(人口统计中断),但是我现在拥有具有权重变量的数据,需要计算加权平均值。我已经在使用代码来计算样本均值,并且想知道是否可以更改更改或调整函数来计算加权均值。这是一些生成示例数据的代码
df <- data.frame(gender=c(2,2,1,1,2,2,1,1,1,1,1,1,2,2,2,2,1,2,2,1),
agegroup=c(2,2,7,5,5,5,2,7,2,2,4,4,4,3,4,5,3,3,6,6),
attitude_1=c(4,3,4,4,4,4,4,4,5,2,5,5,5,4,3,2,3,4,2,4),
attitude_2=c(4,4,1,3,4,2,4,5,5,5,5,4,5,4,3,3,4,4,4,4),
attitude_3=c(2,2,1,1,3,2,5,1,4,2,2,2,3,3,4,1,4,1,3,1),
income=c(40794,74579,62809,47280,72056,57908,70784,96742,66629,117530,79547,54110,39569,111217,109146,56421,106206,28385,85830,71110),
weight=c(1.77,1.89,2.29,6.14,2.07,5.03,0.73,1.60,1.95,2.56,5.41,2.02,6.87,3.23,3.01,4.68,3.42,2.75,2.31,4.04))
到目前为止,我一直在使用这段代码来获取样本均值
assign("Gender_Profile_1",
data.frame(sapply(subset(df, gender==1), FUN = function(x) mean(x, na.rm = TRUE))))
> Gender_Profile_1
sapply.subset.df..gender....1...FUN...function.x..mean.x..na.rm...TRUE..
gender 1.000
agegroup 4.200
attitude_1 4.000
attitude_2 4.000
attitude_3 2.300
income 77274.700
weight 3.016
如您所见,它生成 Gender_Profile_1 以及所有变量的均值。
在尝试计算加权平均值时,我尝试将 "FUN=" 部分更改为此
assign("Gender_Profile_1",
data.frame(sapply(subset(df, gender==1), FUN = function(x) weighted.mean(x, w=weight,na.rm = TRUE))))
我收到以下错误消息
Error in weighted.mean.default(x, w = weight, na.rm = TRUE) :
'x' and 'w' must have the same length
我一直在尝试 df$weight 和 df$x 的各种排列,但似乎没有任何效果。
任何帮助或想法都会很棒。非常感谢
基础 R
如果您想坚持使用 base R,可以执行以下操作:
# define func to return all weighted means
all_wmeans <- function(data_subset) {
# which cols to summarise? all but gender and weight
summ_cols <- setdiff(names(data_subset), c('gender', 'weight'))
# for each col, calc weighted mean with weights from the 'weight' column
result <- lapply(data_subset[, summ_cols],
weighted.mean, w=data_subset$weight)
# squeeze the resuling list back to a data.frame and return
return(data.frame(result))
}
# now, split the df on gender, and apply the func to each chunk
lapply(split(df, df$gender), all_wmeans)
结果是两个数据框的列表,对于gender
的每个值:
$`1`
agegroup attitude_1 attitude_2 attitude_3 income
1 4.397546 4.027851 3.950597 1.962202 74985.25
$`2`
agegroup attitude_1 attitude_2 attitude_3 income
1 4.092234 3.642666 3.676287 2.388872 64075.23
精彩data.table
如果您不介意使用包,dplyr
和 data.table
是很棒的包,可以使这类事情变得更加简单。这里是 data.table
:
# load library and create a data.table object
library(data.table)
my_dt <- data.table(df)
# now it's a one liner:
my_dt[, lapply(.SD, weighted.mean, w=.SD$weight), by=gender]
哪个returns:
gender agegroup attitude_1 attitude_2 attitude_3 income weight
1: 2 4.092234 3.642666 3.676287 2.388872 64075.23 4.099426
2: 1 4.397546 4.027851 3.950597 1.962202 74985.25 3.904483
data.table 代码还按性别对行进行分组,并使用 lapply
将函数和额外参数应用于 S 子集 Data(这就是 .SD
的调用)。从概念上讲,它与基本 R 代码完全相同,只是紧凑且快速。
您可以像这样一次完成所有操作:
sapply(1:2, function(y)
sapply(subset(df, df$gender == y), function(x)
weighted.mean(x, df$weight[df$gender == y])))
#> [,1] [,2]
#> gender 1.000000 2.000000
#> agegroup 4.397546 4.092234
#> attitude_1 4.027851 3.642666
#> attitude_2 3.950597 3.676287
#> attitude_3 1.962202 2.388872
#> income 74985.247679 64075.232966
#> weight 3.904483 4.099426
我认为您的代码的主要问题是您在 sapply 循环中调用权重列,但是,该列尚未被子集化(如 df 那样)。因此,您可以在 sapply 之前对权重列进行子集化,然后使用该子集化的权重进行循环。
使用您发布的代码:
weight <- subset(df, gender==1)[,"weight"]
#Exactly the same code you posted
assign("Gender_Profile_2",
data.frame(sapply(subset(df, gender==1), FUN = function(x) weighted.mean(x, w=weight,na.rm = TRUE))))
这是另一个使用 apply 的解决方案,可能更容易实现:
#Apply the desired function by columns
apply(subset(df, gender==1), 2, FUN = function(x) mean(x, na.rm = TRUE))
#Get the weights of the rows that have gender == 1
weight <- subset(df, gender==1)[,7]
#Apply the wighted mean function
apply(subset(df[,-7], gender==1), 2, FUN = function(x) weighted.mean(x, w=weight,na.rm = TRUE))
我一直在使用代码 运行 表示特定变量值(人口统计中断),但是我现在拥有具有权重变量的数据,需要计算加权平均值。我已经在使用代码来计算样本均值,并且想知道是否可以更改更改或调整函数来计算加权均值。这是一些生成示例数据的代码
df <- data.frame(gender=c(2,2,1,1,2,2,1,1,1,1,1,1,2,2,2,2,1,2,2,1),
agegroup=c(2,2,7,5,5,5,2,7,2,2,4,4,4,3,4,5,3,3,6,6),
attitude_1=c(4,3,4,4,4,4,4,4,5,2,5,5,5,4,3,2,3,4,2,4),
attitude_2=c(4,4,1,3,4,2,4,5,5,5,5,4,5,4,3,3,4,4,4,4),
attitude_3=c(2,2,1,1,3,2,5,1,4,2,2,2,3,3,4,1,4,1,3,1),
income=c(40794,74579,62809,47280,72056,57908,70784,96742,66629,117530,79547,54110,39569,111217,109146,56421,106206,28385,85830,71110),
weight=c(1.77,1.89,2.29,6.14,2.07,5.03,0.73,1.60,1.95,2.56,5.41,2.02,6.87,3.23,3.01,4.68,3.42,2.75,2.31,4.04))
到目前为止,我一直在使用这段代码来获取样本均值
assign("Gender_Profile_1",
data.frame(sapply(subset(df, gender==1), FUN = function(x) mean(x, na.rm = TRUE))))
> Gender_Profile_1
sapply.subset.df..gender....1...FUN...function.x..mean.x..na.rm...TRUE..
gender 1.000
agegroup 4.200
attitude_1 4.000
attitude_2 4.000
attitude_3 2.300
income 77274.700
weight 3.016
如您所见,它生成 Gender_Profile_1 以及所有变量的均值。 在尝试计算加权平均值时,我尝试将 "FUN=" 部分更改为此
assign("Gender_Profile_1",
data.frame(sapply(subset(df, gender==1), FUN = function(x) weighted.mean(x, w=weight,na.rm = TRUE))))
我收到以下错误消息
Error in weighted.mean.default(x, w = weight, na.rm = TRUE) :
'x' and 'w' must have the same length
我一直在尝试 df$weight 和 df$x 的各种排列,但似乎没有任何效果。 任何帮助或想法都会很棒。非常感谢
基础 R
如果您想坚持使用 base R,可以执行以下操作:
# define func to return all weighted means
all_wmeans <- function(data_subset) {
# which cols to summarise? all but gender and weight
summ_cols <- setdiff(names(data_subset), c('gender', 'weight'))
# for each col, calc weighted mean with weights from the 'weight' column
result <- lapply(data_subset[, summ_cols],
weighted.mean, w=data_subset$weight)
# squeeze the resuling list back to a data.frame and return
return(data.frame(result))
}
# now, split the df on gender, and apply the func to each chunk
lapply(split(df, df$gender), all_wmeans)
结果是两个数据框的列表,对于gender
的每个值:
$`1`
agegroup attitude_1 attitude_2 attitude_3 income
1 4.397546 4.027851 3.950597 1.962202 74985.25
$`2`
agegroup attitude_1 attitude_2 attitude_3 income
1 4.092234 3.642666 3.676287 2.388872 64075.23
精彩data.table
如果您不介意使用包,dplyr
和 data.table
是很棒的包,可以使这类事情变得更加简单。这里是 data.table
:
# load library and create a data.table object
library(data.table)
my_dt <- data.table(df)
# now it's a one liner:
my_dt[, lapply(.SD, weighted.mean, w=.SD$weight), by=gender]
哪个returns:
gender agegroup attitude_1 attitude_2 attitude_3 income weight
1: 2 4.092234 3.642666 3.676287 2.388872 64075.23 4.099426
2: 1 4.397546 4.027851 3.950597 1.962202 74985.25 3.904483
data.table 代码还按性别对行进行分组,并使用 lapply
将函数和额外参数应用于 S 子集 Data(这就是 .SD
的调用)。从概念上讲,它与基本 R 代码完全相同,只是紧凑且快速。
您可以像这样一次完成所有操作:
sapply(1:2, function(y)
sapply(subset(df, df$gender == y), function(x)
weighted.mean(x, df$weight[df$gender == y])))
#> [,1] [,2]
#> gender 1.000000 2.000000
#> agegroup 4.397546 4.092234
#> attitude_1 4.027851 3.642666
#> attitude_2 3.950597 3.676287
#> attitude_3 1.962202 2.388872
#> income 74985.247679 64075.232966
#> weight 3.904483 4.099426
我认为您的代码的主要问题是您在 sapply 循环中调用权重列,但是,该列尚未被子集化(如 df 那样)。因此,您可以在 sapply 之前对权重列进行子集化,然后使用该子集化的权重进行循环。
使用您发布的代码:
weight <- subset(df, gender==1)[,"weight"]
#Exactly the same code you posted
assign("Gender_Profile_2",
data.frame(sapply(subset(df, gender==1), FUN = function(x) weighted.mean(x, w=weight,na.rm = TRUE))))
这是另一个使用 apply 的解决方案,可能更容易实现:
#Apply the desired function by columns
apply(subset(df, gender==1), 2, FUN = function(x) mean(x, na.rm = TRUE))
#Get the weights of the rows that have gender == 1
weight <- subset(df, gender==1)[,7]
#Apply the wighted mean function
apply(subset(df[,-7], gender==1), 2, FUN = function(x) weighted.mean(x, w=weight,na.rm = TRUE))