R中数据框中其他列的所有成对分组的列值的总和
Aggregate sum of column values for all pairwise groupings of other columns in a dataframe in R
我一直在尝试聚合数据框中一列的总和,以便对数据框中其他列进行所有成对比较。我拥有的数据集非常大,但下面是一个虚拟集来说明我遇到的问题。我希望能够做到这一点,这样我就不会重复大量代码来单独获得这些成对求和。
library(tidyverse)
library(broom)
data <- data.frame(team= c('A','B','C','A','B', 'A'),
height= c('tall', 'short', 'tall','short','tall','tall'),
size= c('big','small','big','big','small','small'),
cost= c(5,5,4,4,5,4))
lapply(1:ncol(data), function(i) aggregate(data$cost~., data[c(1, i)], sum))
#This gives the results below grouping just first column (team) against
#the others and getting a sum :
[[1]]
team team.1 data$cost
1 A A 13
2 B B 10
3 C C 4
[[2]]
team height data$cost
1 A short 4
2 B short 5
3 A tall 9
4 B tall 5
5 C tall 4
[[3]]
team size data$cost
1 A big 9
2 C big 4
3 A small 4
4 B small 10
[[4]]
team data$cost
1 A 13
2 B 10
3 C 4
我想避免的是必须手动替换聚合函数中的列号,由 data[c(1, i)]
指示以获得下一组成对分组。同样,实际的数据框要大得多,这会很乏味。
我尝试了以下代码并尝试创建一个我可以取消嵌套的列表列表:
all_comparisons <- lapply(1:ncol(data), function(i) aggregate(data$cost~.,
data[c(c(1:i), i)], sum))
huge_list_all_comparisons <- all_comparisons %>% bind_rows(all_comparisons) %>% # make larger sample data
mutate_if(is.list, simplify_all) %>% # flatten each list element internally
unnest()
>huge_list_all_comparisons
A tibble: 40 × 8
team team.1 `data$cost` height height.1 size size.1 cost.1
<chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 A A 13 NA NA NA NA NA
2 B B 10 NA NA NA NA NA
3 C C 4 NA NA NA NA NA
4 A NA 4 short short NA NA NA
5 B NA 5 short short NA NA NA
6 A NA 9 tall tall NA NA NA
7 B NA 5 tall tall NA NA NA
8 C NA 4 tall tall NA NA NA
9 A NA 4 short NA big big NA
10 A NA 5 tall NA big big NA
# … with 30 more rows
其中 returns 每个可能的分组的成本总和,而不仅仅是成对的(在实际数据集中,这将是令人望而却步的,并导致超过一百万行的比较)
如果能帮助我获得一些代码,我将不胜感激,我可以使用这些代码来完成跨数据帧的这种成对分组聚合
您可以使用 combn()
获得可能的索引组合,然后 lapply()
得到它。
library(tidyverse)
data |>
seq_along() |>
combn(2, simplify = F) |>
lapply(\(i) aggregate(data$cost~., data[c(i[1], i[2])], sum))
#> [[1]]
#> team height data$cost
#> 1 A short 4
#> 2 B short 5
#> 3 A tall 9
#> 4 B tall 5
#> 5 C tall 4
#>
#> [[2]]
#> team size data$cost
#> 1 A big 9
#> 2 C big 4
#> 3 A small 4
#> 4 B small 10
#>
#> [[3]]
#> team data$cost
#> 1 A 13
#> 2 B 10
#> 3 C 4
#>
#> [[4]]
#> height size data$cost
#> 1 short big 4
#> 2 tall big 9
#> 3 short small 5
#> 4 tall small 9
#>
#> [[5]]
#> height data$cost
#> 1 short 9
#> 2 tall 18
#>
#> [[6]]
#> size data$cost
#> 1 big 13
#> 2 small 14
由 reprex package (v2.0.1)
于 2022-03-30 创建
我一直在尝试聚合数据框中一列的总和,以便对数据框中其他列进行所有成对比较。我拥有的数据集非常大,但下面是一个虚拟集来说明我遇到的问题。我希望能够做到这一点,这样我就不会重复大量代码来单独获得这些成对求和。
library(tidyverse)
library(broom)
data <- data.frame(team= c('A','B','C','A','B', 'A'),
height= c('tall', 'short', 'tall','short','tall','tall'),
size= c('big','small','big','big','small','small'),
cost= c(5,5,4,4,5,4))
lapply(1:ncol(data), function(i) aggregate(data$cost~., data[c(1, i)], sum))
#This gives the results below grouping just first column (team) against
#the others and getting a sum :
[[1]]
team team.1 data$cost
1 A A 13
2 B B 10
3 C C 4
[[2]]
team height data$cost
1 A short 4
2 B short 5
3 A tall 9
4 B tall 5
5 C tall 4
[[3]]
team size data$cost
1 A big 9
2 C big 4
3 A small 4
4 B small 10
[[4]]
team data$cost
1 A 13
2 B 10
3 C 4
我想避免的是必须手动替换聚合函数中的列号,由 data[c(1, i)]
指示以获得下一组成对分组。同样,实际的数据框要大得多,这会很乏味。
我尝试了以下代码并尝试创建一个我可以取消嵌套的列表列表:
all_comparisons <- lapply(1:ncol(data), function(i) aggregate(data$cost~.,
data[c(c(1:i), i)], sum))
huge_list_all_comparisons <- all_comparisons %>% bind_rows(all_comparisons) %>% # make larger sample data
mutate_if(is.list, simplify_all) %>% # flatten each list element internally
unnest()
>huge_list_all_comparisons
A tibble: 40 × 8
team team.1 `data$cost` height height.1 size size.1 cost.1
<chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 A A 13 NA NA NA NA NA
2 B B 10 NA NA NA NA NA
3 C C 4 NA NA NA NA NA
4 A NA 4 short short NA NA NA
5 B NA 5 short short NA NA NA
6 A NA 9 tall tall NA NA NA
7 B NA 5 tall tall NA NA NA
8 C NA 4 tall tall NA NA NA
9 A NA 4 short NA big big NA
10 A NA 5 tall NA big big NA
# … with 30 more rows
其中 returns 每个可能的分组的成本总和,而不仅仅是成对的(在实际数据集中,这将是令人望而却步的,并导致超过一百万行的比较)
如果能帮助我获得一些代码,我将不胜感激,我可以使用这些代码来完成跨数据帧的这种成对分组聚合
您可以使用 combn()
获得可能的索引组合,然后 lapply()
得到它。
library(tidyverse)
data |>
seq_along() |>
combn(2, simplify = F) |>
lapply(\(i) aggregate(data$cost~., data[c(i[1], i[2])], sum))
#> [[1]]
#> team height data$cost
#> 1 A short 4
#> 2 B short 5
#> 3 A tall 9
#> 4 B tall 5
#> 5 C tall 4
#>
#> [[2]]
#> team size data$cost
#> 1 A big 9
#> 2 C big 4
#> 3 A small 4
#> 4 B small 10
#>
#> [[3]]
#> team data$cost
#> 1 A 13
#> 2 B 10
#> 3 C 4
#>
#> [[4]]
#> height size data$cost
#> 1 short big 4
#> 2 tall big 9
#> 3 short small 5
#> 4 tall small 9
#>
#> [[5]]
#> height data$cost
#> 1 short 9
#> 2 tall 18
#>
#> [[6]]
#> size data$cost
#> 1 big 13
#> 2 small 14
由 reprex package (v2.0.1)
于 2022-03-30 创建