如何根据R中组的最大值聚合数据框

Question

我有一个包含许多组的大型数据，看起来像这样。我想在每组中使用计数最多的水果作为中心水果，并以此为基础聚合其他水果！

library(tidyverse)

df <- tibble(col1 = c("apple","apple","pple", "banana", "banana","bananna"),
             col2 = c("pple","app","app", "bananna", "banan", "banan"), 
             counts_col1 = c(100,100,2,200,200,2),
             counts_col2 = c(2,50,50,2,20,20),
             id=c(1,1,1,2,2,2))

df
#> # A tibble: 6 × 5
#>   col1    col2    counts_col1 counts_col2    id
#>   <chr>   <chr>         <dbl>       <dbl> <dbl>
#> 1 apple   pple            100           2     1
#> 2 apple   app             100          50     1
#> 3 pple    app               2          50     1
#> 4 banana  bananna         200           2     2
#> 5 banana  banan           200          20     2
#> 6 bananna banan             2          20     2

^{由 reprex package (v2.0.1)}

于 2022-03-16 创建

我希望我的数据框看起来像这样

id  central_fruit   fruits                 counts     sum_counts
 1     apple        apple,pple,app         100,50,2        152
 2    banana        banana,bananna,banan   200,20,2        222

输出的格式不一定要这样。这只是一个例子。它可以是字符列表或只是字符。感谢任何帮助或指导

Answer 1

我们可以通过首先重塑为 'long' 格式 (pivot_longer)，按 'id'、'grp' 分组，创建频率计数 (add_count), 然后 summarise 'central_fruit' 的 max 频率比 'id', 类似地 paste (toString) unique 水果，并且 unique 与 sum 一起计数 unique 计数

library(dplyr)
library(stringr)
library(tidyr)
df %>%
   rename_with(~ str_c("fruit_", .x), starts_with('col')) %>% 
   pivot_longer(cols = -id, names_to = c(".value", "grp"), 
     names_pattern = "(.*)_(col\d+)") %>% 
   group_by(id, grp) %>%
   add_count(fruit) %>%
   group_by(id) %>% 
   summarise(central_fruit = fruit[which.max(n)], 
      fruits = toString(unique(fruit)), 
      sum_counts = sum(unique(counts)),
      counts = toString(sort(unique(counts), decreasing = TRUE)),
        .groups = 'drop' ) %>%
     relocate(counts, .before = 'sum_counts')

-输出

# A tibble: 2 × 5
     id central_fruit fruits                 counts     sum_counts
  <dbl> <chr>         <chr>                  <chr>           <dbl>
1     1 apple         apple, pple, app       100, 50, 2        152
2     2 banana        banana, bananna, banan 200, 20, 2        222

注意：将 'counts' 的值包装在 list 中而不是 paste 中可能更好。即，而不是 counts = toString(sort(unique(counts), decreasing = TRUE))，它将是 counts = list(sort(unique(counts), decreasing = TRUE))

Answer 2

使用 data.table，您可以：

Reprex

代码

library(tidyverse) # to read your tibble
library(data.table)

setDT(df)[, .(central_fruit = col1[which.max(counts_col1)],
              fruits = .(unique(c(col1, col2))),
              counts = .(sort(unique(c(counts_col1, counts_col2)), decreasing = TRUE)),
              sum_counts = unlist(lapply(.(unique(c(counts_col1, counts_col2))), sum))), 
          by = id]

输出

#>       id central_fruit               fruits      counts sum_counts
#>    <num>        <char>               <list>      <list>      <num>
#> 1:     1         apple       apple,pple,app 100, 50,  2        152
#> 2:     2        banana banana,bananna,banan 200, 20,  2        222

^{由 reprex package (v2.0.1)}

于 2022-03-16 创建

如何根据R中组的最大值聚合数据框

How to aggregate a data frame based on the max value of the group in R

r

dplyr

data.table

tidyr

tidyverse