在返回的列事先未知时进行分组和汇总

Question

我有一个数据框——比如 x——它提供一个函数，该函数 returns 一个子集取决于列 x$id 的值。

此子集 y 包含一个列 y$room，该列包含不同的值组合，具体取决于 x$id 值。

然后用 tidyr 展开子集，y$room 的值变成列。
然后生成的扩展 df --say ext_y-- 必须按列分组 y_ext$visit 并且应该通过特殊函数计算剩余列的摘要统计信息。

明显的问题是这些列是事先不知道的，因此不能在函数中通过它们的名称来定义。

当涉及 group_by 时，使用列索引而不是名称的替代方法似乎不适用于 dplyr。

您知道如何解决这个问题吗？

数据框有数千行，所以我只给你看一眼：

       > tail(y)
           id visit        room value
     11940 14     2 living room    19
     11941 14     2 living room    16
     11942 14     2 living room    15
     11943 14     2 living room    22
     11944 14     2 living room    25
     11945 14     2 living room    20

     > unique(x$id)
    [1]  14  20  41  44  46  54  64  74 104 106
     > unique(x$visit)
    [1] 0 1 2
     > unique(x$room)
     [1] "bedroom"      "living room"  "family  room" "study room"   "den"         
     [6] "tv room"      "office"       "hall"         "kitchen"      "dining room" 
     > summary(x$value)
         Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
        2.000    2.750    7.875   17.410   16.000 1775.000

对于给定的 id，tidyr 的 spread() returns 只是 x 中房间值的一个子集。例如。对于 id = 54:

  > y<- out
  > y$row <- 1 : nrow(y)
  > y_ext <- spread(y, room, value)
  > head(y_ext)
       id visit row bedroom family  room living room
     1 14     0   1    6.00           NA          NA
     2 14     0   2    6.00           NA          NA
     3 14     0   3    2.75           NA          NA
     4 14     0   4    2.75           NA          NA
     5 14     0   5    2.75           NA          NA
     6 14     0   6    2.75           NA          NA

现在，我必须编写一个函数，按访问对结果进行分组，并按以下形式汇总为每个组返回的列：

         visit    bedroom    family room   living room
      1   0         NA            2.79         3.25
      2   1         NA             NA          4.53
      3   2         4.19           3.77        NA

正如我上面提到的，我事先不知道对于给定的 ID 将返回哪些列，这使问题变得复杂。当然，捷径是检查并找出每个 id 返回了哪些列，然后创建一个 if 结构将每个 id 指向适当的代码，但恐怕这不是很优雅。

希望这有助于给你一个更好的画面。

Answer 1

好吧，这对我来说很有趣，所以我自己制作了一些示例数据：

nSamples <- 50

allRooms <-
  c("Living", "Dining", "Bedroom", "Master", "Family", "Garage", "Office")

set.seed(8675309)

df <-
  data_frame(
    id = sample(1:5, nSamples, TRUE)
    , visit = sample(1:3, nSamples, TRUE)
    , room = sample(allRooms, nSamples, TRUE)
    , value = round(rnorm(nSamples, 20, 5))
  )

在我看来，共有三种方法，按合理性升序排列。第一个选择是遵循您的基本布局。在这里，我将 df 拆分为 id，按照说明展开，然后使用 summarise_all 进行求和，无需明确识别房间名称。

df %>%
  split(.$id) %>%
  lapply(function(x){
    x %>%
      select(-id) %>%
      mutate(row = 1:n()) %>%
      spread(room, value) %>%
      select(-row) %>%
      group_by(visit) %>%
      summarise_all(sum, na.rm = TRUE)
  })

这 returns 以下（注意唯一列）：

$`1`
# A tibble: 3 × 6
  visit Bedroom Dining Garage Master Office
  <int>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1     1       0     27     27      0      0
2     2      22     19      0     20     23
3     3       0      0      0     27      0

$`2`
# A tibble: 3 × 6
  visit Bedroom Dining Family Living Office
  <int>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1     1      15      0      0      0     17
2     2       0     14     42     30      0
3     3      15     13     18      0     20

$`3`
# A tibble: 3 × 6
  visit Bedroom Dining Living Master Office
  <int>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1     1      24      0     36      0     28
2     2       0      0     15     30      0
3     3       0     25     21      0     15

$`4`
# A tibble: 3 × 7
  visit Bedroom Dining Garage Living Master Office
  <int>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1     1       0      0     23     20      0     24
2     2       0     28     22      0      0      0
3     3      24      0     36      0     16      0

$`5`
# A tibble: 3 × 8
  visit Bedroom Dining Family Garage Living Master Office
  <int>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1     1      23      0      0     21      0     16      0
2     2      44     14     41      0     26      0     18
3     3      21     19      0      0     25     19      0

但是，因为您必须添加行才能使 spread 起作用（没有它，就没有唯一的条目），所以 spread 实际上没有帮助。如果你先做总结，你可以更容易地得到同样的东西，就像这样：

df %>%
  split(.$id) %>%
  lapply(function(x){
    x %>%
      select(-id) %>%
      group_by(visit, room) %>%
      summarise(Sum = sum(value)) %>%
      spread(room, Sum, 0)
  })

请注意，由于 fill 参数的最后一个 0，它为没有访问过的房间提供 0。如果您希望 returns NA，您可以保留默认值。

最后，不清楚您为什么要首先单独执行此操作。在一个大 group_by 中完成所有这些并在事后根据需要处理缺失可能更有意义。也就是说，获得相同摘要的代码要少得多。

df %>%
  group_by(id, visit, room) %>%
  summarise(sum = sum(value)) %>%
  spread(room, sum)

给予

      id visit Bedroom Dining Family Garage Living Master Office
*  <int> <int>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1      1     1      NA     27     NA     27     NA     NA     NA
2      1     2      22     19     NA     NA     NA     20     23
3      1     3      NA     NA     NA     NA     NA     27     NA
4      2     1      15     NA     NA     NA     NA     NA     17
5      2     2      NA     14     42     NA     30     NA     NA
6      2     3      15     13     18     NA     NA     NA     20
7      3     1      24     NA     NA     NA     36     NA     28
8      3     2      NA     NA     NA     NA     15     30     NA
9      3     3      NA     25     NA     NA     21     NA     15
10     4     1      NA     NA     NA     23     20     NA     24
11     4     2      NA     28     NA     22     NA     NA     NA
12     4     3      24     NA     NA     36     NA     16     NA
13     5     1      23     NA     NA     21     NA     16     NA
14     5     2      44     14     41     NA     26     NA     18
15     5     3      21     19     NA     NA     25     19     NA

如果您只想过滤到一个 id，请在事后使用 filter，然后删除包含所有 NA 条目的列。（请注意，您可能会保存输出一次，然后为每个感兴趣的 id 通过最后两行传递一次，例如，在打印时）

df %>%
  group_by(id, visit, room) %>%
  summarise(sum = sum(value)) %>%
  spread(room, sum) %>%
  filter(id == 1) %>%
  select_if(function(col) mean(is.na(col)) != 1)

给予

     id visit Bedroom Dining Garage Master Office
  <int> <int>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1     1     1      NA     27     27     NA     NA
2     1     2      22     19     NA     20     23
3     1     3      NA     NA     NA     27     NA

在返回的列事先未知时进行分组和汇总

Grouping and summarising when the columns returned are not known in advance

grouping

r

summarization

dynamic-columns

dplyr