R中数据框中其他列的所有成对分组的列值的总和

Aggregate sum of column values for all pairwise groupings of other columns in a dataframe in R

我一直在尝试聚合数据框中一列的总和,以便对数据框中其他列进行所有成对比较。我拥有的数据集非常大,但下面是一个虚拟集来说明我遇到的问题。我希望能够做到这一点,这样我就不会重复大量代码来单独获得这些成对求和。

library(tidyverse)
library(broom)

data <- data.frame(team= c('A','B','C','A','B', 'A'),
       height= c('tall', 'short', 'tall','short','tall','tall'),
       size= c('big','small','big','big','small','small'),
       cost= c(5,5,4,4,5,4))

lapply(1:ncol(data), function(i) aggregate(data$cost~., data[c(1, i)], sum)) 

#This gives the results below grouping just first column (team) against
#the others and getting a sum :

[[1]]
  team team.1 data$cost
1     A       A        13
2     B       B        10
3     C       C         4

[[2]]
  team height data$cost
1     A  short         4
2     B  short         5
3     A   tall         9
4     B   tall         5
5     C   tall         4

[[3]]
  team  size data$cost
1     A   big         9
2     C   big         4
3     A small         4
4     B small        10

[[4]]
  team data$cost
1     A        13
2     B        10
3     C         4

我想避免的是必须手动替换聚合函数中的列号,由 data[c(1, i)] 指示以获得下一组成对分组。同样,实际的数据框要大得多,这会很乏味。

我尝试了以下代码并尝试创建一个我可以取消嵌套的列表列表:

all_comparisons <- lapply(1:ncol(data), function(i) aggregate(data$cost~., 
                                                       data[c(c(1:i), i)], sum))

huge_list_all_comparisons <- all_comparisons %>% bind_rows(all_comparisons) %>%    # make larger sample data
  mutate_if(is.list, simplify_all) %>%    # flatten each list element internally 
  unnest()  

>huge_list_all_comparisons
 A tibble: 40 × 8
   team team.1 `data$cost` height height.1 size  size.1 cost.1
   <chr> <chr>         <dbl> <chr>  <chr>    <chr> <chr>   <dbl>
 1 A     A                13 NA     NA       NA    NA         NA
 2 B     B                10 NA     NA       NA    NA         NA
 3 C     C                 4 NA     NA       NA    NA         NA
 4 A     NA                4 short  short    NA    NA         NA
 5 B     NA                5 short  short    NA    NA         NA
 6 A     NA                9 tall   tall     NA    NA         NA
 7 B     NA                5 tall   tall     NA    NA         NA
 8 C     NA                4 tall   tall     NA    NA         NA
 9 A     NA                4 short  NA       big   big        NA
10 A     NA                5 tall   NA       big   big        NA
# … with 30 more rows

其中 returns 每个可能的分组的成本总和,而不仅仅是成对的(在实际数据集中,这将是令人望而却步的,并导致超过一百万行的比较)

如果能帮助我获得一些代码,我将不胜感激,我可以使用这些代码来完成跨数据帧的这种成对分组聚合

您可以使用 combn() 获得可能的索引组合,然后 lapply() 得到它。

library(tidyverse)

data |> 
  seq_along() |> 
  combn(2, simplify = F) |> 
  lapply(\(i) aggregate(data$cost~., data[c(i[1], i[2])], sum)) 
#> [[1]]
#>   team height data$cost
#> 1    A  short         4
#> 2    B  short         5
#> 3    A   tall         9
#> 4    B   tall         5
#> 5    C   tall         4
#> 
#> [[2]]
#>   team  size data$cost
#> 1    A   big         9
#> 2    C   big         4
#> 3    A small         4
#> 4    B small        10
#> 
#> [[3]]
#>   team data$cost
#> 1    A        13
#> 2    B        10
#> 3    C         4
#> 
#> [[4]]
#>   height  size data$cost
#> 1  short   big         4
#> 2   tall   big         9
#> 3  short small         5
#> 4   tall small         9
#> 
#> [[5]]
#>   height data$cost
#> 1  short         9
#> 2   tall        18
#> 
#> [[6]]
#>    size data$cost
#> 1   big        13
#> 2 small        14

reprex package (v2.0.1)

于 2022-03-30 创建