Tidyverse:过滤分组数据框中的 n 个最大组
Tidyverse: filtering n largest groups in grouped dataframe
我想根据count过滤出最大的n个分组,然后对过滤后的dataframe做一些计算
这是一些数据
Brand <- c("A","B","C","A","A","B","A","A","B","C")
Category <- c(1,2,1,1,2,1,2,1,2,1)
Clicks <- c(10,11,12,13,14,15,14,13,12,11)
df <- data.frame(Brand,Category,Clicks)
|Brand | Category| Clicks|
|:-----|--------:|------:|
|A | 1| 10|
|B | 2| 11|
|C | 1| 12|
|A | 1| 13|
|A | 2| 14|
|B | 1| 15|
|A | 2| 14|
|A | 1| 13|
|B | 2| 12|
|C | 1| 11|
这是我的预期输出。我想通过计数筛选出两个最大的品牌,然后找到每个品牌/类别组合的平均点击次数
|Brand | Category| mean_clicks|
|:-----|--------:|-----------:|
|A | 1| 12.0|
|A | 2| 14.0|
|B | 1| 15.0|
|B | 2| 11.5|
我认为可以用这样的代码实现(但不能)
df %>%
group_by(Brand, Category) %>%
top_n(2, Brand) %>% # Largest 2 brands by count
summarise(mean_clicks = mean(Clicks))
编辑:理想的答案应该能够用于数据库表和本地表
编辑
根据更新的问题,我们可以先添加一个计数列,仅过滤前 n
组计数,然后 group_by
Brand
和 Category
找到 mean
每组。
df %>%
add_count(Brand, sort = TRUE) %>%
filter(n %in% head(unique(n), 2)) %>%
group_by(Brand, Category) %>%
summarise(mean_clicks = mean(Clicks))
# Brand Category mean_clicks
# <fct> <dbl> <dbl>
#1 A 1 12
#2 A 2 14
#3 B 1 15
#4 B 2 11.5
原答案
我们可以 group_by
Brand
并按组进行所有计算,然后按 top_n
过滤顶部组
library(dplyr)
df %>%
group_by(Brand) %>%
summarise(n = n(),
mean = mean(Clicks)) %>%
top_n(2, n) %>%
select(-n)
# Brand mean
# <fct> <dbl>
#1 A 12.8
#2 B 12.7
一个 data.table 的想法是根据 Brands
对计数进行分组并过滤前两个(按降序排序后)。然后我们与原始数据框合并,找到按 (Brand, Category)
分组的平均值
library(data.table)
#Convert to data.table
dt1 <- setDT(df)
dt1[dt1[, .(cnt = .N), by = Brand][
order(cnt, decreasing = TRUE), .SD[1:2]][,cnt := NULL],
on = 'Brand'][, .(means = mean(Clicks)), by = .(Brand, Category)][]
这给出了,
Brand Category means
1: A 1 12.0
2: A 2 14.0
3: B 2 11.5
4: B 1 15.0
不同的dplyr
解决方案:
df %>%
group_by(Brand) %>%
mutate(n = n()) %>%
ungroup() %>%
mutate(rank = dense_rank(desc(n))) %>%
filter(rank == 1 | rank == 2) %>%
group_by(Brand, Category) %>%
summarise(mean_clicks = mean(Clicks))
# A tibble: 4 x 3
# Groups: Brand [?]
Brand Category mean_clicks
<fct> <dbl> <dbl>
1 A 1. 12.0
2 A 2. 14.0
3 B 1. 15.0
4 B 2. 11.5
或简化版本(基于@camille 的建议):
df %>%
group_by(Brand) %>%
mutate(n = n()) %>%
ungroup() %>%
filter(dense_rank(desc(n)) < 3) %>%
group_by(Brand, Category) %>%
summarise(mean_clicks = mean(Clicks))
这种方法如何,使用 table
,从基础 R -
df %>%
filter(Brand %in% names(tail(sort(table(Brand)), 2))) %>%
group_by(Brand, Category) %>%
summarise(mean_clicks = mean(Clicks))
# A tibble: 4 x 3
# Groups: Brand [?]
Brand Category mean_clicks
<chr> <dbl> <dbl>
1 A 1.00 12.0
2 A 2.00 14.0
3 B 1.00 15.0
4 B 2.00 11.5
另一个dplyr
解决方案使用join
过滤数据框:
library(dplyr)
df %>%
group_by(Brand) %>%
summarise(n = n()) %>%
top_n(2) %>% # select top 2
left_join(df, by = "Brand") %>% # filters out top 2 Brands
group_by(Brand, Category) %>%
summarise(mean_clicks = mean(Clicks))
# # A tibble: 4 x 3
# # Groups: Brand [?]
# Brand Category mean_clicks
# <fct> <dbl> <dbl>
# 1 A 1 12
# 2 A 2 14
# 3 B 1 15
# 4 B 2 11.5
与上面略有不同。只是因为我不喜欢对大型数据集使用连接。有些人可能不喜欢我制作和删除一个小数据框,抱歉:(
df %>% count(Brand) %>% top_n(2,n) -> Top2
df %>% group_by(Brand, Category) %>%
filter(Brand %in% Top2$Brand) %>%
summarise(mean_clicks = mean(Clicks))
remove(Top2)
我想根据count过滤出最大的n个分组,然后对过滤后的dataframe做一些计算
这是一些数据
Brand <- c("A","B","C","A","A","B","A","A","B","C")
Category <- c(1,2,1,1,2,1,2,1,2,1)
Clicks <- c(10,11,12,13,14,15,14,13,12,11)
df <- data.frame(Brand,Category,Clicks)
|Brand | Category| Clicks|
|:-----|--------:|------:|
|A | 1| 10|
|B | 2| 11|
|C | 1| 12|
|A | 1| 13|
|A | 2| 14|
|B | 1| 15|
|A | 2| 14|
|A | 1| 13|
|B | 2| 12|
|C | 1| 11|
这是我的预期输出。我想通过计数筛选出两个最大的品牌,然后找到每个品牌/类别组合的平均点击次数
|Brand | Category| mean_clicks|
|:-----|--------:|-----------:|
|A | 1| 12.0|
|A | 2| 14.0|
|B | 1| 15.0|
|B | 2| 11.5|
我认为可以用这样的代码实现(但不能)
df %>%
group_by(Brand, Category) %>%
top_n(2, Brand) %>% # Largest 2 brands by count
summarise(mean_clicks = mean(Clicks))
编辑:理想的答案应该能够用于数据库表和本地表
编辑
根据更新的问题,我们可以先添加一个计数列,仅过滤前 n
组计数,然后 group_by
Brand
和 Category
找到 mean
每组。
df %>%
add_count(Brand, sort = TRUE) %>%
filter(n %in% head(unique(n), 2)) %>%
group_by(Brand, Category) %>%
summarise(mean_clicks = mean(Clicks))
# Brand Category mean_clicks
# <fct> <dbl> <dbl>
#1 A 1 12
#2 A 2 14
#3 B 1 15
#4 B 2 11.5
原答案
我们可以 group_by
Brand
并按组进行所有计算,然后按 top_n
library(dplyr)
df %>%
group_by(Brand) %>%
summarise(n = n(),
mean = mean(Clicks)) %>%
top_n(2, n) %>%
select(-n)
# Brand mean
# <fct> <dbl>
#1 A 12.8
#2 B 12.7
一个 data.table 的想法是根据 Brands
对计数进行分组并过滤前两个(按降序排序后)。然后我们与原始数据框合并,找到按 (Brand, Category)
library(data.table)
#Convert to data.table
dt1 <- setDT(df)
dt1[dt1[, .(cnt = .N), by = Brand][
order(cnt, decreasing = TRUE), .SD[1:2]][,cnt := NULL],
on = 'Brand'][, .(means = mean(Clicks)), by = .(Brand, Category)][]
这给出了,
Brand Category means 1: A 1 12.0 2: A 2 14.0 3: B 2 11.5 4: B 1 15.0
不同的dplyr
解决方案:
df %>%
group_by(Brand) %>%
mutate(n = n()) %>%
ungroup() %>%
mutate(rank = dense_rank(desc(n))) %>%
filter(rank == 1 | rank == 2) %>%
group_by(Brand, Category) %>%
summarise(mean_clicks = mean(Clicks))
# A tibble: 4 x 3
# Groups: Brand [?]
Brand Category mean_clicks
<fct> <dbl> <dbl>
1 A 1. 12.0
2 A 2. 14.0
3 B 1. 15.0
4 B 2. 11.5
或简化版本(基于@camille 的建议):
df %>%
group_by(Brand) %>%
mutate(n = n()) %>%
ungroup() %>%
filter(dense_rank(desc(n)) < 3) %>%
group_by(Brand, Category) %>%
summarise(mean_clicks = mean(Clicks))
这种方法如何,使用 table
,从基础 R -
df %>%
filter(Brand %in% names(tail(sort(table(Brand)), 2))) %>%
group_by(Brand, Category) %>%
summarise(mean_clicks = mean(Clicks))
# A tibble: 4 x 3
# Groups: Brand [?]
Brand Category mean_clicks
<chr> <dbl> <dbl>
1 A 1.00 12.0
2 A 2.00 14.0
3 B 1.00 15.0
4 B 2.00 11.5
另一个dplyr
解决方案使用join
过滤数据框:
library(dplyr)
df %>%
group_by(Brand) %>%
summarise(n = n()) %>%
top_n(2) %>% # select top 2
left_join(df, by = "Brand") %>% # filters out top 2 Brands
group_by(Brand, Category) %>%
summarise(mean_clicks = mean(Clicks))
# # A tibble: 4 x 3
# # Groups: Brand [?]
# Brand Category mean_clicks
# <fct> <dbl> <dbl>
# 1 A 1 12
# 2 A 2 14
# 3 B 1 15
# 4 B 2 11.5
与上面略有不同。只是因为我不喜欢对大型数据集使用连接。有些人可能不喜欢我制作和删除一个小数据框,抱歉:(
df %>% count(Brand) %>% top_n(2,n) -> Top2
df %>% group_by(Brand, Category) %>%
filter(Brand %in% Top2$Brand) %>%
summarise(mean_clicks = mean(Clicks))
remove(Top2)