如何根据R中的产品列表汇总客户购买的产品数量
How to summarize number of products purchased by a customer based on a product list in R
我在 R 中有两个数据框,一个包含产品 SKU(产品 ID)列表,另一个包含包含订单号、客户电子邮件、购买日期和产品 ID 的购买日志( product_sku) 和购买数量。
purchases_dataframe:
order_number | email | product_sku | quantity | purchase_date
1000 |customer1@sample.com | RT-100 | 2 | 2020-01-01
1000 |customer1@sample.com | CT-300 | 1 | 2020-01-01
1000 |customer1@sample.com | Phone-100 | 1 | 2020-01-01
2000 |customer2@sample.com | Phone-200 | 1 | 2020-04-20
2000 |customer2@sample.com | OM-200 | 1 | 2020-04-20
3000 |customer3@sample.com | CT-300 | 3 | 2020-03-15
4000 |customer1@sample.com | OM-200 | 5 | 2020-07-07
5000 |customer4@sample.com | Phone-200 | 3 | 2020-08-19
6000 |customer3@sample.com | Phone-100 | 1 | 2020-09-22
6000 |customer3@sample.com | RT-100 | 1 | 2020-09-22
tv_list:
SKU
RT-100
CT-300
OM-200
LL-400
...
我想统计客户在 his/her 一生中购买的电视总数,并忽略所有其他产品(例如电话)。数据框 tv_list 应该可以帮助我识别哪些 SKU 是电视,哪些不是,因为我有各种不同的电视 SKU,上面只是一个较小的例子。
理想情况下,生成的数据框如下所示:
email | number_purchased_tv
customer1@sample.com | 8
customer2@sample.com | 1
customer3@sample.com | 4
customer4@sample.com | 0
为了可重复性和更容易理解我的示例,这里是 上面两个 sample_tables 的代码:
purchase_dataframe <- data.frame(order_number = c(1000,1000,1000, 2000,2000, 3000, 4000, 5000, 6000, 6000),
email = c("customer1@sample.com","customer1@sample.com", "customer1@sample.com","customer2@sample.com",
"customer2@sample.com","customer3@sample.com","customer1@sample.com","customer4@sample.com",
"customer3@sample.com","customer3@sample.com"),
product_sku = c("RT-100", "CT-300", "Phone-100", "Phone-200", "OM-200", "CT-300", "OM-200", "Phone-200", "Phone-100", "RT-100"),
quantity = c(2,1,1,1,1,3,5,3,1,1),
purchase_date = c("2020-01-01","2020-01-01","2020-01-01","2020-04-20","2020-04-20","2020-03-15","2020-07-07","2020-08-19","2020-09-22","2020-09-22"))
tv_list <- data.frame(SKU = c("RT-100", "OM-200", "CT-300", "LL-400", "ZV-700"))
非常感谢!
下面使用 dplyr
执行您的请求
library(dplyr)
library(data.table)
purchase_dataframe %>% dplyr::group_by(email) %>% dplyr::summarise(sumtv = sum(quantity[product_sku %in% unique(tv_list$SKU)]))
# A tibble: 4 x 2
email sumtv
<chr> <dbl>
1 customer1@sample.com 8
2 customer2@sample.com 1
3 customer3@sample.com 4
4 customer4@sample.com 0
编辑 请在上面找到关于 sumtv
图的更正和下面的 data.table
解决方案
library(dplyr)
library(data.table)
purchase_datatable <- purchase_dataframe
purchase_datatable %>% setDT
> purchase_datatable[,sumtv := sum(quantity[product_sku %in% unique(tv_list$SKU)]), by="email"][
+ ,.(email, sumtv)] %>% unique
email sumtv
1: customer1@sample.com 8
2: customer2@sample.com 1
3: customer3@sample.com 4
4: customer4@sample.com 0
microbenchmark
ing 为 data.table
解决方案提供了近 50% 的优势,IMO 是一个优秀的软件包,非常值得通过这些 vignettes
学习
library(microbenchmark)
microbenchmark(purchase_datatable[,sumtv := sum(quantity[product_sku %in% unique(tv_list$SKU)]), by="email"][
,.(email, sumtv)] %>% unique, purchase_dataframe %>% dplyr::group_by(email) %>% dplyr::summarise(sumtv = sum(quantity[product_sku %in% unique(tv_list$SKU)]))
)
min lq mean median uq max neval
1.268 1.42700 1.823445 1.80300 2.0887 2.8332 100
2.715 2.98025 3.250287 3.20355 3.3509 8.8255 100
一个选项使用base R
:
#Match and index
purchase_dataframe$ProductIndex <- tv_list[match(purchase_dataframe$product_sku,tv_list$SKU),'SKU']
purchase_dataframe$Counter <- ifelse(is.na(purchase_dataframe$ProductIndex),0,purchase_dataframe$quantity)
#Aggregate
Res <- aggregate(Counter~email,data=purchase_dataframe,sum,na.rm=T)
输出:
email Counter
1 customer1@sample.com 8
2 customer2@sample.com 1
3 customer3@sample.com 4
4 customer4@sample.com 0
tv_purchases <-
purchase_dataframe %>%
group_by(email) %>%
filter(product_sku %in% tv_list$SKU) %>%
summarise(number_purchased_tv = sum(as.numeric(quantity)))
## join tv_purchases on distinct emails, to also have the 'customer4@sample.com 0' row
purchase_dataframe %>%
distinct(email) %>%
left_join(tv_purchases) %>% ## emails which are not in tv_purchases will have NAs
mutate(number_purchased_tv = case_when(is.na(number_purchased_tv) ~ 0, ## NAs become zeros
TRUE ~ number_purchased_tv) ## non-NAs stay as they are
)
这是您提供的数据:
library('dplyr')
purchase_dataframe <- data.frame(order_number = c(1000,1000,1000, 2000,2000, 3000, 4000, 5000, 6000, 6000),
email = c("customer1@sample.com","customer1@sample.com", "customer1@sample.com","customer2@sample.com",
"customer2@sample.com","customer3@sample.com","customer1@sample.com","customer4@sample.com",
"customer3@sample.com","customer3@sample.com"),
product_sku = c("RT-100", "CT-300", "Phone-100", "Phone-200", "OM-200", "CT-300", "OM-200", "Phone-200", "Phone-100", "RT-100"),
quantity = c(2,1,1,1,1,3,5,3,1,1),
purchase_date = c("2020-01-01","2020-01-01","2020-01-01","2020-04-20","2020-04-20","2020-03-15","2020-07-07","2020-08-19","2020-09-22","2020-09-22"))
tv_list <- data.frame(SKU = c("RT-100", "OM-200", "CT-300", "LL-400", "ZV-700"))
这将为您提供摘要,但会省略任何电子邮件(尚未购买电视的客户)
total_tvs_by_cusomter <- purchase_dataframe %>%
filter(product_sku %in% tv_list$SKU) %>%
group_by(email) %>%
mutate(quantity = as.numeric(quantity)) %>%
summarise(number_purchased_tv = sum(quantity))
结果:
# A tibble: 3 x 2
email number_purchased_tv
<chr> <dbl>
1 customer1@sample.com 8
2 customer2@sample.com 1
3 customer3@sample.com 4
如果您想保留尚未购买电视的 emails/customers 并将它们添加为 0
total_tvs_by_cusomter <- left_join(unique(purchase_dataframe %>%
select(email)), total_tvs_by_cusomter)
total_tvs_by_cusomter[is.na(total_tvs_by_cusomter)] <- 0
结果:
email number_purchased_tv
1 customer1@sample.com 8
2 customer2@sample.com 1
3 customer3@sample.com 4
4 customer4@sample.com 0
我在 R 中有两个数据框,一个包含产品 SKU(产品 ID)列表,另一个包含包含订单号、客户电子邮件、购买日期和产品 ID 的购买日志( product_sku) 和购买数量。
purchases_dataframe:
order_number | email | product_sku | quantity | purchase_date
1000 |customer1@sample.com | RT-100 | 2 | 2020-01-01
1000 |customer1@sample.com | CT-300 | 1 | 2020-01-01
1000 |customer1@sample.com | Phone-100 | 1 | 2020-01-01
2000 |customer2@sample.com | Phone-200 | 1 | 2020-04-20
2000 |customer2@sample.com | OM-200 | 1 | 2020-04-20
3000 |customer3@sample.com | CT-300 | 3 | 2020-03-15
4000 |customer1@sample.com | OM-200 | 5 | 2020-07-07
5000 |customer4@sample.com | Phone-200 | 3 | 2020-08-19
6000 |customer3@sample.com | Phone-100 | 1 | 2020-09-22
6000 |customer3@sample.com | RT-100 | 1 | 2020-09-22
tv_list:
SKU
RT-100
CT-300
OM-200
LL-400
...
我想统计客户在 his/her 一生中购买的电视总数,并忽略所有其他产品(例如电话)。数据框 tv_list 应该可以帮助我识别哪些 SKU 是电视,哪些不是,因为我有各种不同的电视 SKU,上面只是一个较小的例子。 理想情况下,生成的数据框如下所示:
email | number_purchased_tv
customer1@sample.com | 8
customer2@sample.com | 1
customer3@sample.com | 4
customer4@sample.com | 0
为了可重复性和更容易理解我的示例,这里是 上面两个 sample_tables 的代码:
purchase_dataframe <- data.frame(order_number = c(1000,1000,1000, 2000,2000, 3000, 4000, 5000, 6000, 6000),
email = c("customer1@sample.com","customer1@sample.com", "customer1@sample.com","customer2@sample.com",
"customer2@sample.com","customer3@sample.com","customer1@sample.com","customer4@sample.com",
"customer3@sample.com","customer3@sample.com"),
product_sku = c("RT-100", "CT-300", "Phone-100", "Phone-200", "OM-200", "CT-300", "OM-200", "Phone-200", "Phone-100", "RT-100"),
quantity = c(2,1,1,1,1,3,5,3,1,1),
purchase_date = c("2020-01-01","2020-01-01","2020-01-01","2020-04-20","2020-04-20","2020-03-15","2020-07-07","2020-08-19","2020-09-22","2020-09-22"))
tv_list <- data.frame(SKU = c("RT-100", "OM-200", "CT-300", "LL-400", "ZV-700"))
非常感谢!
下面使用 dplyr
library(dplyr)
library(data.table)
purchase_dataframe %>% dplyr::group_by(email) %>% dplyr::summarise(sumtv = sum(quantity[product_sku %in% unique(tv_list$SKU)]))
# A tibble: 4 x 2
email sumtv
<chr> <dbl>
1 customer1@sample.com 8
2 customer2@sample.com 1
3 customer3@sample.com 4
4 customer4@sample.com 0
编辑 请在上面找到关于 sumtv
图的更正和下面的 data.table
解决方案
library(dplyr)
library(data.table)
purchase_datatable <- purchase_dataframe
purchase_datatable %>% setDT
> purchase_datatable[,sumtv := sum(quantity[product_sku %in% unique(tv_list$SKU)]), by="email"][
+ ,.(email, sumtv)] %>% unique
email sumtv
1: customer1@sample.com 8
2: customer2@sample.com 1
3: customer3@sample.com 4
4: customer4@sample.com 0
microbenchmark
ing 为 data.table
解决方案提供了近 50% 的优势,IMO 是一个优秀的软件包,非常值得通过这些 vignettes
library(microbenchmark)
microbenchmark(purchase_datatable[,sumtv := sum(quantity[product_sku %in% unique(tv_list$SKU)]), by="email"][
,.(email, sumtv)] %>% unique, purchase_dataframe %>% dplyr::group_by(email) %>% dplyr::summarise(sumtv = sum(quantity[product_sku %in% unique(tv_list$SKU)]))
)
min lq mean median uq max neval
1.268 1.42700 1.823445 1.80300 2.0887 2.8332 100
2.715 2.98025 3.250287 3.20355 3.3509 8.8255 100
一个选项使用base R
:
#Match and index
purchase_dataframe$ProductIndex <- tv_list[match(purchase_dataframe$product_sku,tv_list$SKU),'SKU']
purchase_dataframe$Counter <- ifelse(is.na(purchase_dataframe$ProductIndex),0,purchase_dataframe$quantity)
#Aggregate
Res <- aggregate(Counter~email,data=purchase_dataframe,sum,na.rm=T)
输出:
email Counter
1 customer1@sample.com 8
2 customer2@sample.com 1
3 customer3@sample.com 4
4 customer4@sample.com 0
tv_purchases <-
purchase_dataframe %>%
group_by(email) %>%
filter(product_sku %in% tv_list$SKU) %>%
summarise(number_purchased_tv = sum(as.numeric(quantity)))
## join tv_purchases on distinct emails, to also have the 'customer4@sample.com 0' row
purchase_dataframe %>%
distinct(email) %>%
left_join(tv_purchases) %>% ## emails which are not in tv_purchases will have NAs
mutate(number_purchased_tv = case_when(is.na(number_purchased_tv) ~ 0, ## NAs become zeros
TRUE ~ number_purchased_tv) ## non-NAs stay as they are
)
这是您提供的数据:
library('dplyr')
purchase_dataframe <- data.frame(order_number = c(1000,1000,1000, 2000,2000, 3000, 4000, 5000, 6000, 6000),
email = c("customer1@sample.com","customer1@sample.com", "customer1@sample.com","customer2@sample.com",
"customer2@sample.com","customer3@sample.com","customer1@sample.com","customer4@sample.com",
"customer3@sample.com","customer3@sample.com"),
product_sku = c("RT-100", "CT-300", "Phone-100", "Phone-200", "OM-200", "CT-300", "OM-200", "Phone-200", "Phone-100", "RT-100"),
quantity = c(2,1,1,1,1,3,5,3,1,1),
purchase_date = c("2020-01-01","2020-01-01","2020-01-01","2020-04-20","2020-04-20","2020-03-15","2020-07-07","2020-08-19","2020-09-22","2020-09-22"))
tv_list <- data.frame(SKU = c("RT-100", "OM-200", "CT-300", "LL-400", "ZV-700"))
这将为您提供摘要,但会省略任何电子邮件(尚未购买电视的客户)
total_tvs_by_cusomter <- purchase_dataframe %>%
filter(product_sku %in% tv_list$SKU) %>%
group_by(email) %>%
mutate(quantity = as.numeric(quantity)) %>%
summarise(number_purchased_tv = sum(quantity))
结果:
# A tibble: 3 x 2
email number_purchased_tv
<chr> <dbl>
1 customer1@sample.com 8
2 customer2@sample.com 1
3 customer3@sample.com 4
如果您想保留尚未购买电视的 emails/customers 并将它们添加为 0
total_tvs_by_cusomter <- left_join(unique(purchase_dataframe %>%
select(email)), total_tvs_by_cusomter)
total_tvs_by_cusomter[is.na(total_tvs_by_cusomter)] <- 0
结果:
email number_purchased_tv
1 customer1@sample.com 8
2 customer2@sample.com 1
3 customer3@sample.com 4
4 customer4@sample.com 0