在带有动态变量的 for 循环中将 ddply 与 weighted.mean 结合使用
Using ddply in combo with weighted.mean in a for loop with dynamic variables
我的数据集如下所示:
structure(list(GEOLEV2 = structure(c("768001001", "768001001",
"768001002", "768001002", "768001006", "768001006", "768001002",
"768001002", "768001002", "768001002", "768002016", "768002016"
), format.stata = "%9s"), DHSYEAR = structure(c(1988, 1988, 1988,
1988, 1998, 1998, 1998, 1998, 2013, 2013, 2013, 2013), format.stata = "%9.0g"),
v005 = structure(c(1e+06, 1e+06, 1e+06, 1e+06, 1815025, 1815025,
1517492, 1517492, 1350366, 1350366, 617033, 617033), format.stata = "%9.0g"),
age = structure(c(37, 22, 18, 46, 15, 29, 18, 42, 19, 15,
35, 16), format.stata = "%9.0g"), highest_year_edu = structure(c(2,
6, NA, NA, 5, NA, 2, 3, 2, NA, 5, 3), format.stata = "%9.0g")), row.names = c(NA,
-12L), class = c("tbl_df", "tbl", "data.frame"), label = "Written by R")
我想在 df1$GEOLEV2
/df1$DHSYEAR
的基础上折叠它,以 weighted.mean
作为折叠功能。每个变量应保持相同的名称。
我选择了函数 ddply
,当我在单个变量上尝试它时,它起作用了:
ddply(df1, ~ df1$GEOLEV2+ df1$DHSYEAR, summarise, age = weighted.mean(age, v005, na.rm = TRUE))
但是,当我构建循环时,函数 returns 出错了。我的审判是:
df1_collapsed <- ddply(df1, ~ df1$GEOLEV2+ df1$DHSYEAR, summarise, age = weighted.mean(age, v005, na.rm = TRUE))
for (i in names(df1[4,5)) {
variable <- ddply(df1, ~ df1$GEOLEV2+ df1$DHSYEAR, summarise, i = weighted.mean(i, v005, na.rm = TRUE))
df1_collapsed <- left_join(df1_collapsed, variable, by = c("df1$GEOLEV2", "df1$DHSYEAR"))
}
错误是
Error in weighted.mean.default(i, v005, na.rm = TRUE) :
'x' and 'w' must have the same length
如何构建 for 循环,将变量名嵌入循环中?
一般来说,在 R 中,您不需要循环来进行分组和汇总(在 Stata 中,您会称之为折叠)。您可以使用 dplyr
进行此类操作:
df1 %>%
group_by(GEOLEV2, DHSYEAR) %>%
summarise(
across(age:highest_year_edu, ~ weighted.mean(.x, v005, na.rm = TRUE))
)
# A tibble: 6 x 4
# Groups: GEOLEV2 [4]
# GEOLEV2 DHSYEAR age highest_year_edu
# <chr> <dbl> <dbl> <dbl>
# 1 768001001 1988 29.5 4
# 2 768001002 1988 32 NaN
# 3 768001002 1998 30 2.5
# 4 768001002 2013 17 2
# 5 768001006 1998 22 5
# 6 768002016 2013 25.5 4
我的数据集如下所示:
structure(list(GEOLEV2 = structure(c("768001001", "768001001",
"768001002", "768001002", "768001006", "768001006", "768001002",
"768001002", "768001002", "768001002", "768002016", "768002016"
), format.stata = "%9s"), DHSYEAR = structure(c(1988, 1988, 1988,
1988, 1998, 1998, 1998, 1998, 2013, 2013, 2013, 2013), format.stata = "%9.0g"),
v005 = structure(c(1e+06, 1e+06, 1e+06, 1e+06, 1815025, 1815025,
1517492, 1517492, 1350366, 1350366, 617033, 617033), format.stata = "%9.0g"),
age = structure(c(37, 22, 18, 46, 15, 29, 18, 42, 19, 15,
35, 16), format.stata = "%9.0g"), highest_year_edu = structure(c(2,
6, NA, NA, 5, NA, 2, 3, 2, NA, 5, 3), format.stata = "%9.0g")), row.names = c(NA,
-12L), class = c("tbl_df", "tbl", "data.frame"), label = "Written by R")
我想在 df1$GEOLEV2
/df1$DHSYEAR
的基础上折叠它,以 weighted.mean
作为折叠功能。每个变量应保持相同的名称。
我选择了函数 ddply
,当我在单个变量上尝试它时,它起作用了:
ddply(df1, ~ df1$GEOLEV2+ df1$DHSYEAR, summarise, age = weighted.mean(age, v005, na.rm = TRUE))
但是,当我构建循环时,函数 returns 出错了。我的审判是:
df1_collapsed <- ddply(df1, ~ df1$GEOLEV2+ df1$DHSYEAR, summarise, age = weighted.mean(age, v005, na.rm = TRUE))
for (i in names(df1[4,5)) {
variable <- ddply(df1, ~ df1$GEOLEV2+ df1$DHSYEAR, summarise, i = weighted.mean(i, v005, na.rm = TRUE))
df1_collapsed <- left_join(df1_collapsed, variable, by = c("df1$GEOLEV2", "df1$DHSYEAR"))
}
错误是
Error in weighted.mean.default(i, v005, na.rm = TRUE) :
'x' and 'w' must have the same length
如何构建 for 循环,将变量名嵌入循环中?
一般来说,在 R 中,您不需要循环来进行分组和汇总(在 Stata 中,您会称之为折叠)。您可以使用 dplyr
进行此类操作:
df1 %>%
group_by(GEOLEV2, DHSYEAR) %>%
summarise(
across(age:highest_year_edu, ~ weighted.mean(.x, v005, na.rm = TRUE))
)
# A tibble: 6 x 4
# Groups: GEOLEV2 [4]
# GEOLEV2 DHSYEAR age highest_year_edu
# <chr> <dbl> <dbl> <dbl>
# 1 768001001 1988 29.5 4
# 2 768001002 1988 32 NaN
# 3 768001002 1998 30 2.5
# 4 768001002 2013 17 2
# 5 768001006 1998 22 5
# 6 768002016 2013 25.5 4