对指定列进行 numcolwise 操作

Question

我正在处理大型跨国面板数据。这是我的数据示例：

df <- structure(list(country = c("Argentina", "Argentina", "Argentina", 
"Argentina", "Argentina", "Argentina", "Argentina", "Argentina", 
"Argentina", "Argentina", "Argentina", "Argentina", "Argentina", 
"Argentina", "Argentina", "Brazil", "Brazil", "Brazil", "Brazil", 
"Brazil", "Brazil", "Brazil", "Brazil", "Brazil", "Brazil", "Brazil", 
"Brazil", "Brazil", "Brazil", "Brazil"), year = c(1991, 1992, 
1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 
2004, 2005, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 
2000, 2001, 2002, 2003, 2004, 2005), lnunderval = c(-0.942018220566855, 
-0.885848248127534, -0.766349222095516, -0.690487190951407, -0.521023028925771, 
-0.288557433912095, -0.351488637772915, -0.393048184068511, -0.444123691025518, 
-0.512425182981147, -0.541182815398097, 0.379018666505875, 0.291852440172936, 
0.291407056285245, 0.221426753100227, -0.120418577004832, 0.00467960055625634, 
-0.0190735963658737, -0.239570582118898, -0.316748349307701, 
-0.205418347557874, -0.301707274202926, -0.346946676711871, -0.0528811487098006, 
-0.178001370772517, -0.0404491572081528, 0.0898307782259906, 
0.0835291098039626, 0.0349739055576117, -0.187321483795299), 
    manu_GDP = c(24.3864490932335, 21.8591315586603, 18.2399115325496, 
    17.8190917106899, 17.2467521148076, 17.5357232920479, 18.227905749866, 
    17.8379584760908, 16.9615250614589, 16.4942719439838, 16.0932258763829, 
    20.347773913878, 22.4867505875749, 18.9370136214371, 18.340415936715, 
    21.8391379495813, 23.3085986320751, 26.0497364463813, 23.7212337008806, 
    14.5422791544751, 13.0671912367218, 13.0186253732125, 12.1551371940101, 
    12.3085333305115, 13.134659593552, 13.0895379354001, 12.3569626673735, 
    14.4507645630532, 15.0995301563871, 14.7382811342998), income = c("Upper middle income", 
    "Upper middle income", "Upper middle income", "Upper middle income", 
    "Upper middle income", "Upper middle income", "Upper middle income", 
    "Upper middle income", "Upper middle income", "Upper middle income", 
    "Upper middle income", "Upper middle income", "Upper middle income", 
    "Upper middle income", "Upper middle income", "Upper middle income", 
    "Upper middle income", "Upper middle income", "Upper middle income", 
    "Upper middle income", "Upper middle income", "Upper middle income", 
    "Upper middle income", "Upper middle income", "Upper middle income", 
    "Upper middle income", "Upper middle income", "Upper middle income", 
    "Upper middle income", "Upper middle income"), period = structure(c(1L, 
    1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 1L, 
    1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("(1990,1995]", 
    "(1995,2000]", "(2000,2005]"), class = "factor")), row.names = c(NA, 
-30L), class = c("tbl_df", "tbl", "data.frame"))

我使用 cut 和 ddply 函数创建了我的变量的五年非重叠平均值，如下所示。

df$period <- cut(df$year, seq(1990, 2005, 5)) #this periodizes data
df <- ddply(df, .(country, period), numcolwise(mean))

此代码的问题是名为 income 的非数字列丢失了。我尝试了以下但没有用。

df <- ddply(df, .(country, period), numcolwise(mean,.(lnunderval, manu_GDP))) 
Error in mean.default(X[[i]], ...) : 'trim' must be numeric of length one

我希望最终数据集包含未计算平均值的非数字列。有没有办法在指定的一组列上应用 numcolwise 函数？

我希望最终输出如下所示：

structure(list(country = c("Argentina", "Argentina", "Argentina", 
"Brazil", "Brazil", "Brazil"), period = structure(c(1L, 2L, 3L, 
1L, 2L, 3L), .Label = c("(1990,1995]", "(1995,2000]", "(2000,2005]"
), class = "factor"), year = c(1993, 1998, 2003, 1993, 1998, 
2003), lnunderval = c(-0.761145182133417, -0.397928625952037, 
0.128504420133237, -0.13822630084821, -0.216990963590998, -0.00388736948317731
), manu_GDP = c(19.9102672019882, 17.4114769046895, 19.2410359871976, 
21.8921971766787, 12.7368293456016, 13.9470152913027), income = c("Upper middle income", 
"Upper middle income", "Upper middle income", "Upper middle income", 
"Upper middle income", "Upper middle income")), class = "data.frame", row.names = c(NA, 
-6L))

Answer 1

我们可以使用dplyr，它与across到summarise多个不同功能的列块

更灵活

library(dplyr)
df %>%
  group_by(country, period) %>%
  summarise(year = last(year), income = list(unique(income[!is.na(income)])), 
    across(c(lnunderval, manu_GDP), mean), .groups = 'drop')

对指定列进行 numcolwise 操作

operating numcolwise on a specified columns

r

plyr

dplyr