如何根据R中组的最大值聚合数据框
How to aggregate a data frame based on the max value of the group in R
我有一个包含许多组的大型数据,看起来像这样。
我想在每组中使用计数最多的水果作为中心水果,
并以此为基础聚合其他水果!
library(tidyverse)
df <- tibble(col1 = c("apple","apple","pple", "banana", "banana","bananna"),
col2 = c("pple","app","app", "bananna", "banan", "banan"),
counts_col1 = c(100,100,2,200,200,2),
counts_col2 = c(2,50,50,2,20,20),
id=c(1,1,1,2,2,2))
df
#> # A tibble: 6 × 5
#> col1 col2 counts_col1 counts_col2 id
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 apple pple 100 2 1
#> 2 apple app 100 50 1
#> 3 pple app 2 50 1
#> 4 banana bananna 200 2 2
#> 5 banana banan 200 20 2
#> 6 bananna banan 2 20 2
由 reprex package (v2.0.1)
于 2022-03-16 创建
我希望我的数据框看起来像这样
id central_fruit fruits counts sum_counts
1 apple apple,pple,app 100,50,2 152
2 banana banana,bananna,banan 200,20,2 222
输出的格式不一定要这样。这只是一个例子。它可以是字符列表或只是字符。
感谢任何帮助或指导
我们可以通过首先重塑为 'long' 格式 (pivot_longer
),按 'id'、'grp' 分组,创建频率计数 (add_count
), 然后 summarise
'central_fruit' 的 max
频率比 'id', 类似地 paste
(toString
) unique
水果,并且 unique
与 sum
一起计数 unique
计数
library(dplyr)
library(stringr)
library(tidyr)
df %>%
rename_with(~ str_c("fruit_", .x), starts_with('col')) %>%
pivot_longer(cols = -id, names_to = c(".value", "grp"),
names_pattern = "(.*)_(col\d+)") %>%
group_by(id, grp) %>%
add_count(fruit) %>%
group_by(id) %>%
summarise(central_fruit = fruit[which.max(n)],
fruits = toString(unique(fruit)),
sum_counts = sum(unique(counts)),
counts = toString(sort(unique(counts), decreasing = TRUE)),
.groups = 'drop' ) %>%
relocate(counts, .before = 'sum_counts')
-输出
# A tibble: 2 × 5
id central_fruit fruits counts sum_counts
<dbl> <chr> <chr> <chr> <dbl>
1 1 apple apple, pple, app 100, 50, 2 152
2 2 banana banana, bananna, banan 200, 20, 2 222
注意:将 'counts' 的值包装在 list
中而不是 paste
中可能更好。即,而不是 counts = toString(sort(unique(counts), decreasing = TRUE))
,它将是
counts = list(sort(unique(counts), decreasing = TRUE))
使用 data.table
,您可以:
Reprex
- 代码
library(tidyverse) # to read your tibble
library(data.table)
setDT(df)[, .(central_fruit = col1[which.max(counts_col1)],
fruits = .(unique(c(col1, col2))),
counts = .(sort(unique(c(counts_col1, counts_col2)), decreasing = TRUE)),
sum_counts = unlist(lapply(.(unique(c(counts_col1, counts_col2))), sum))),
by = id]
- 输出
#> id central_fruit fruits counts sum_counts
#> <num> <char> <list> <list> <num>
#> 1: 1 apple apple,pple,app 100, 50, 2 152
#> 2: 2 banana banana,bananna,banan 200, 20, 2 222
由 reprex package (v2.0.1)
于 2022-03-16 创建
我有一个包含许多组的大型数据,看起来像这样。 我想在每组中使用计数最多的水果作为中心水果, 并以此为基础聚合其他水果!
library(tidyverse)
df <- tibble(col1 = c("apple","apple","pple", "banana", "banana","bananna"),
col2 = c("pple","app","app", "bananna", "banan", "banan"),
counts_col1 = c(100,100,2,200,200,2),
counts_col2 = c(2,50,50,2,20,20),
id=c(1,1,1,2,2,2))
df
#> # A tibble: 6 × 5
#> col1 col2 counts_col1 counts_col2 id
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 apple pple 100 2 1
#> 2 apple app 100 50 1
#> 3 pple app 2 50 1
#> 4 banana bananna 200 2 2
#> 5 banana banan 200 20 2
#> 6 bananna banan 2 20 2
由 reprex package (v2.0.1)
于 2022-03-16 创建我希望我的数据框看起来像这样
id central_fruit fruits counts sum_counts
1 apple apple,pple,app 100,50,2 152
2 banana banana,bananna,banan 200,20,2 222
输出的格式不一定要这样。这只是一个例子。它可以是字符列表或只是字符。 感谢任何帮助或指导
我们可以通过首先重塑为 'long' 格式 (pivot_longer
),按 'id'、'grp' 分组,创建频率计数 (add_count
), 然后 summarise
'central_fruit' 的 max
频率比 'id', 类似地 paste
(toString
) unique
水果,并且 unique
与 sum
一起计数 unique
计数
library(dplyr)
library(stringr)
library(tidyr)
df %>%
rename_with(~ str_c("fruit_", .x), starts_with('col')) %>%
pivot_longer(cols = -id, names_to = c(".value", "grp"),
names_pattern = "(.*)_(col\d+)") %>%
group_by(id, grp) %>%
add_count(fruit) %>%
group_by(id) %>%
summarise(central_fruit = fruit[which.max(n)],
fruits = toString(unique(fruit)),
sum_counts = sum(unique(counts)),
counts = toString(sort(unique(counts), decreasing = TRUE)),
.groups = 'drop' ) %>%
relocate(counts, .before = 'sum_counts')
-输出
# A tibble: 2 × 5
id central_fruit fruits counts sum_counts
<dbl> <chr> <chr> <chr> <dbl>
1 1 apple apple, pple, app 100, 50, 2 152
2 2 banana banana, bananna, banan 200, 20, 2 222
注意:将 'counts' 的值包装在 list
中而不是 paste
中可能更好。即,而不是 counts = toString(sort(unique(counts), decreasing = TRUE))
,它将是
counts = list(sort(unique(counts), decreasing = TRUE))
使用 data.table
,您可以:
Reprex
- 代码
library(tidyverse) # to read your tibble
library(data.table)
setDT(df)[, .(central_fruit = col1[which.max(counts_col1)],
fruits = .(unique(c(col1, col2))),
counts = .(sort(unique(c(counts_col1, counts_col2)), decreasing = TRUE)),
sum_counts = unlist(lapply(.(unique(c(counts_col1, counts_col2))), sum))),
by = id]
- 输出
#> id central_fruit fruits counts sum_counts
#> <num> <char> <list> <list> <num>
#> 1: 1 apple apple,pple,app 100, 50, 2 152
#> 2: 2 banana banana,bananna,banan 200, 20, 2 222
由 reprex package (v2.0.1)
于 2022-03-16 创建