在 dplyr 包中使用 summarize 和 across，同时区分数字列和非数字列

Question

我想使用 dplyr 在如下所示的数据集上执行一些操作：

data <- data.frame(day = c(rep(1, 15), rep(2, 15)), nweek = rep(rep(1:5, 3),2), 
                   firm = rep(sapply(letters[1:3], function(x) rep(x, 5)), 2), 
                   quant = rnorm(30), price = runif(30) )

其中每个观察是在天、周和公司级别（一周只有 2 天）。

我想通过以下方式总结数据（按 firm 分组）(1) 取一周 across 变量的平均值 numeric（即 quant 和 price)，并为非数字变量取第一个条目（在这种情况下它只是 firm，但在我的真实数据集中我有多个非数字变量（Date 和 character），它们可能会在一周内发生变化（nweek），所以我只想在一周的第一天输入所有非数字变量。

我尝试使用 summarise 和 across 但出现错误

> data %>% group_by(firm, nweek) %>% dplyr::summarise(across(which(sapply(data, is.numeric)), ~ mean(.x, na.rm = TRUE)),
+                           across(which(sapply(data, !(is.numeric))), ~ head(.x, 1))
+ )
Error: Problem with `summarise()` input `..2`.
x invalid argument type
ℹ Input `..2` is `across(which(sapply(data, !(is.numeric))), ~head(.x, 1))`.
Run `rlang::last_error()` to see where the error occurred.

有什么帮助吗？

Answer 1

我不知道你的预期输出应该是什么样子，但像这样的东西可能会达到你想要达到的目标

data %>%
  group_by(firm, nweek) %>% 
  summarise(
    across(where(is.numeric), ~ mean(.x, na.rm = TRUE)),
    across(!where(is.numeric), ~ head(.x, 1))
)

作为旁注，不要使用 which(sapply(...))，请查看 post 中 across 内变量的条件选择的 where 助手。

输出

# A tibble: 15 x 5
# Groups:   firm [3]
   firm  nweek   day   quant price
   <chr> <int> <dbl>   <dbl> <dbl>
 1 a         1   1.5 -0.336  0.903
 2 a         2   1.5  0.0837 0.579
 3 a         3   1.5  0.0541 0.425
 4 a         4   1.5  1.21   0.555
 5 a         5   1.5  0.462  0.806
 6 b         1   1.5  0.0493 0.346
 7 b         2   1.5  0.635  0.596
 8 b         3   1.5  0.406  0.583
 9 b         4   1.5 -0.707  0.205
10 b         5   1.5  0.157  0.816
11 c         1   1.5  0.728  0.271
12 c         2   1.5  0.117  0.775
13 c         3   1.5 -1.05   0.234
14 c         4   1.5 -1.35   0.290
15 c         5   1.5  0.771  0.310

在 dplyr 包中使用 summarize 和 across，同时区分数字列和非数字列

Using summarise and across in the dplyr package while distinguishing between numeric and non-numeric columns

r

numeric

dplyr

across