如何使用 dplyr 从两组中成对计算列

Question

我有一个这种形状的数据集。

group   a1   a2   ...   a9   b1   b2 ... b7
1       1    0    ...   1    0    1  ... 1
1       1    1    ...   1    0    0  ... 1
1       0    0    ...   0    1    0  ... 1
1       1    1    ...   0    1    1  ... 0
2       1    0    ...   1    0    1  ... 1
2       1    1    ...   1    0    0  ... 1
2       0    0    ...   0    1    0  ... 1
2       1    1    ...   0    1    1  ... 0
...

我想做的是对所有列对应用双参数汇总函数，保持数据的分组性质。

所以，例如

f = function(a, b) { mean(a) + mean(b) + mean(a & b) }

会 return 类似的东西（我实际上并不打算计算函数的值，我只是把 "x" 用来表示统计数据的去向，但当然它每个 group-a-b 组合都会有所不同。

group a_col  b_col  stat
1     a1     b1     x
1     a1     b2     x
1     a1     b3     x
...
1     a9     b7     x
2     a1     b1     x
...

一位评论者要求提供一些样本数据。这是：

structure(list(group = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 
3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 
7L, 7L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 9L, 10L, 10L), a1 = c(0L, 
1L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 
1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 
1L, 0L, 0L, 0L), a2 = c(0L, 0L, 0L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 
0L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 
0L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 1L), a3 = c(1L, 1L, 1L, 
1L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 
1L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 1L, 
0L, 0L), a4 = c(0L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 
1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 
0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L), a5 = c(1L, 0L, 0L, 0L, 0L, 
0L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 0L, 
0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 0L, 0L
), b1 = c(1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 
0L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 
0L, 1L, 0L, 1L, 1L, 0L, 0L, 0L), b2 = c(0L, 0L, 1L, 0L, 0L, 0L, 
1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 
1L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 0L), 
    b3 = c(0L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 0L, 
    1L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 
    1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA, 
-37L))

Answer 1

使用 tidyverse 的解决方案。我们可以gather两次根据首字母的列进行运算。假设你的数据被称为 dat，dat2 是最终的输出。

library(tidyverse)

dat2 <- dat %>%
  gather(column_a, value_a, starts_with("a")) %>%
  gather(column_b, value_b, starts_with("b")) %>%
  group_by(group, column_a, column_b) %>%
  summarise(stat = mean(value_a) + mean(value_b) + mean(value_a + value_b)) %>%
  ungroup()
dat2
# # A tibble: 150 x 4
#    group column_a column_b  stat
#    <int> <chr>    <chr>    <dbl>
#  1     1 a1       b1         3  
#  2     1 a1       b2         2  
#  3     1 a1       b3         2  
#  4     1 a2       b1         2  
#  5     1 a2       b2         1  
#  6     1 a2       b3         1  
#  7     1 a3       b1         3.5
#  8     1 a3       b2         2.5
#  9     1 a3       b3         2.5
# 10     1 a4       b1         2  
# # ... with 140 more rows

如何使用 dplyr 从两组中成对计算列

How to compute on columns pairwise, from two groups with dplyr

r

dplyr

purrr

pairwise