是否有 dplyr 函数来确定组内最常遇到的分类值?
Is there a dplyr function to determine the most commonly encountered categorical value within a group?
我希望使用 dplyr 将客户交易数据框汇总为每个客户的一行。对于连续变量,这很简单——使用 sum / mean 等。对于分类变量,我想选择 "Mode" - 即组内最常遇到的值,并跨多个列执行此操作,例如:
例如取table Cus1
Cus <- data.frame(Customer = c("C-01", "C-01", "C-02", "C-02", "C-02", "C-02", "C-03", "C-03"),
Product = c("COKE", "COKE", "FRIES", "SHAKE", "BURGER", "BURGER", "CHICKEN", "FISH"),
Store = c("NYC", "NYC", "Chicago", "Chicago", "Detroit", "Detroit", "LA", "San Fran")
)
并生成 table Cus_Summary:
Cus_Summary <- data.frame(Customer = c("C-01", "C-02", "C-03"),
Product = c("COKE", "BURGER", "CHICKEN"),
Store = c("NYC", "Chicago", "LA")
)
有没有可以提供这个功能的包?或者谁有可以在一个 dplyr 步骤中跨多个列应用的函数?
我不担心处理领带的聪明方法 - 领带的任何输出就足够了(尽管关于如何最好地处理领带的任何建议都会很有趣并受到赞赏)。
这个怎么样?
Cus %>%
group_by(Customer) %>%
summarise(
Product = first(names(sort(table(Product), decreasing = TRUE))),
Store = first(names(sort(table(Store), decreasing = TRUE))))
## A tibble: 3 x 3
# Customer Product Store
# <fct> <chr> <chr>
#1 C-01 COKE NYC
#2 C-02 BURGER Chicago
#3 C-03 CHICKEN LA
请注意,在平局的情况下,此 select 是按字母顺序排列的第一个条目。
更新
为了随机 select我们可以定义一个自定义函数
top_random <- function(x) {
tbl <- sort(table(x), decreasing = T)
top <- tbl[tbl == max(tbl)]
return(sample(names(top), 1))
}
然后以下随机 select 是并列最高的条目之一:
Cus %>%
group_by(Customer) %>%
summarise(
Product = top_random(Product),
Store = top_random(Store))
如果您有很多列并且想找出所有列中的最大出现次数,您可以使用 gather
将数据转换为长格式,count
每列的出现次数,group_by
Customer
和列并仅保留具有最大计数的行,然后 spread
将其恢复为宽格式。
library(tidyverse)
Cus %>%
gather(key, value, -Customer) %>%
count(Customer, key, value) %>%
group_by(Customer, key) %>%
slice(which.max(n)) %>%
ungroup() %>%
spread(key, value) %>%
select(-n)
# Customer Product Store
# <fct> <chr> <chr>
#1 C-01 COKE NYC
#2 C-02 BURGER Chicago
#3 C-03 CHICKEN LA
编辑
如果我们想要随机 select 平局,我们可以 filter
所有 max
值,然后使用 sample_n
函数来 select随机行。
Cus %>%
gather(key, value, -Customer) %>%
count(Customer, key, value) %>%
group_by(Customer, key) %>%
filter(n == max(n)) %>%
sample_n(1) %>%
ungroup() %>%
spread(key, value) %>%
select(-n)
# Customer Product Store
# <fct> <chr> <chr>
#1 C-01 COKE NYC
#2 C-02 BURGER Chicago
#3 C-03 FISH San Fran
在我的解决方案中,如果有多个最频繁出现的值,则全部显示:
library(tidyverse)
Cus %>%
gather('type', 'value', -Customer) %>%
group_by(Customer, type, value) %>%
count() %>%
group_by(Customer) %>%
filter(n == max(n)) %>%
nest() %>%
mutate(
Product = map_chr(data, ~str_c(filter(.x, type == 'Product') %>% pull(value), collapse = ', ')),
Store = map_chr(data, ~str_c(filter(.x, type == 'Store') %>% pull(value), collapse = ', '))
) %>%
select(-data)
结果是:
# A tibble: 3 x 3
Customer Product Store
<fct> <chr> <chr>
1 C-01 COKE NYC
2 C-02 BURGER Chicago, Detroit
3 C-03 CHICKEN, FISH LA, San Fran
使用 SO's favourite Mode function(尽管你可以使用任何一个):
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
在基数 R
aggregate(. ~ Customer, lapply(Cus,as.character), Mode)
# Customer Product Store
# 1 C-01 COKE NYC
# 2 C-02 BURGER Chicago
# 3 C-03 CHICKEN LA
使用 dplyr
library(dplyr)
Cus %>%
group_by(Customer) %>%
summarise_all(Mode)
# # A tibble: 3 x 3
# Customer Product Store
# <fctr> <fctr> <fctr>
# 1 C-01 COKE NYC
# 2 C-02 BURGER Chicago
# 3 C-03 CHICKEN LA
我希望使用 dplyr 将客户交易数据框汇总为每个客户的一行。对于连续变量,这很简单——使用 sum / mean 等。对于分类变量,我想选择 "Mode" - 即组内最常遇到的值,并跨多个列执行此操作,例如:
例如取table Cus1
Cus <- data.frame(Customer = c("C-01", "C-01", "C-02", "C-02", "C-02", "C-02", "C-03", "C-03"),
Product = c("COKE", "COKE", "FRIES", "SHAKE", "BURGER", "BURGER", "CHICKEN", "FISH"),
Store = c("NYC", "NYC", "Chicago", "Chicago", "Detroit", "Detroit", "LA", "San Fran")
)
并生成 table Cus_Summary:
Cus_Summary <- data.frame(Customer = c("C-01", "C-02", "C-03"),
Product = c("COKE", "BURGER", "CHICKEN"),
Store = c("NYC", "Chicago", "LA")
)
有没有可以提供这个功能的包?或者谁有可以在一个 dplyr 步骤中跨多个列应用的函数?
我不担心处理领带的聪明方法 - 领带的任何输出就足够了(尽管关于如何最好地处理领带的任何建议都会很有趣并受到赞赏)。
这个怎么样?
Cus %>%
group_by(Customer) %>%
summarise(
Product = first(names(sort(table(Product), decreasing = TRUE))),
Store = first(names(sort(table(Store), decreasing = TRUE))))
## A tibble: 3 x 3
# Customer Product Store
# <fct> <chr> <chr>
#1 C-01 COKE NYC
#2 C-02 BURGER Chicago
#3 C-03 CHICKEN LA
请注意,在平局的情况下,此 select 是按字母顺序排列的第一个条目。
更新
为了随机 select我们可以定义一个自定义函数
top_random <- function(x) {
tbl <- sort(table(x), decreasing = T)
top <- tbl[tbl == max(tbl)]
return(sample(names(top), 1))
}
然后以下随机 select 是并列最高的条目之一:
Cus %>%
group_by(Customer) %>%
summarise(
Product = top_random(Product),
Store = top_random(Store))
如果您有很多列并且想找出所有列中的最大出现次数,您可以使用 gather
将数据转换为长格式,count
每列的出现次数,group_by
Customer
和列并仅保留具有最大计数的行,然后 spread
将其恢复为宽格式。
library(tidyverse)
Cus %>%
gather(key, value, -Customer) %>%
count(Customer, key, value) %>%
group_by(Customer, key) %>%
slice(which.max(n)) %>%
ungroup() %>%
spread(key, value) %>%
select(-n)
# Customer Product Store
# <fct> <chr> <chr>
#1 C-01 COKE NYC
#2 C-02 BURGER Chicago
#3 C-03 CHICKEN LA
编辑
如果我们想要随机 select 平局,我们可以 filter
所有 max
值,然后使用 sample_n
函数来 select随机行。
Cus %>%
gather(key, value, -Customer) %>%
count(Customer, key, value) %>%
group_by(Customer, key) %>%
filter(n == max(n)) %>%
sample_n(1) %>%
ungroup() %>%
spread(key, value) %>%
select(-n)
# Customer Product Store
# <fct> <chr> <chr>
#1 C-01 COKE NYC
#2 C-02 BURGER Chicago
#3 C-03 FISH San Fran
在我的解决方案中,如果有多个最频繁出现的值,则全部显示:
library(tidyverse)
Cus %>%
gather('type', 'value', -Customer) %>%
group_by(Customer, type, value) %>%
count() %>%
group_by(Customer) %>%
filter(n == max(n)) %>%
nest() %>%
mutate(
Product = map_chr(data, ~str_c(filter(.x, type == 'Product') %>% pull(value), collapse = ', ')),
Store = map_chr(data, ~str_c(filter(.x, type == 'Store') %>% pull(value), collapse = ', '))
) %>%
select(-data)
结果是:
# A tibble: 3 x 3
Customer Product Store
<fct> <chr> <chr>
1 C-01 COKE NYC
2 C-02 BURGER Chicago, Detroit
3 C-03 CHICKEN, FISH LA, San Fran
使用 SO's favourite Mode function(尽管你可以使用任何一个):
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
在基数 R
aggregate(. ~ Customer, lapply(Cus,as.character), Mode)
# Customer Product Store
# 1 C-01 COKE NYC
# 2 C-02 BURGER Chicago
# 3 C-03 CHICKEN LA
使用 dplyr
library(dplyr)
Cus %>%
group_by(Customer) %>%
summarise_all(Mode)
# # A tibble: 3 x 3
# Customer Product Store
# <fctr> <fctr> <fctr>
# 1 C-01 COKE NYC
# 2 C-02 BURGER Chicago
# 3 C-03 CHICKEN LA