如何查找商品及其组合的购买次数？

Question

我有一个 data.table 显示客户购买了哪些商品。每行代表一位客户，每一列代表一项。 table 每个客户的列数相同，并且列 item* 中的值是 1 或 0，具体取决于客户是否购买了给定的项目。 table 的简单版本如下所示：

data.table(customerID = c(1,2,3,4,5),
           item1 = c(1,0,0,1,1),
           item2 = c(1,0,1,1,1),
           item3 = c(1,0,0,0,1),
           item4 = c(0,1,1,1,1))

table 表示客户 1 购买了商品 1、2、3，而商品 3 是由客户 1 和 5 购买的。

在实际情况下，data.table 有很多列，在代码中按名称引用它们是不切实际的，但使用长格式数据就可以了。

我需要了解单个项目被购买了多少次以及它们的组合被购买了多少次。在这种情况下，我想得到类似的东西：

item1 3
item2 4
item3 2
item4 4
item1;item2 3
item1;item3 2
item1;item4 1
...
(same for other combinations of length 2)
...
item1;item2;item3 2
item1;item2;item4 1

...
up to combinations of 4 items.

此外，我需要为每个客户提供一个 table 来表明他或她购买了哪些产品组合。

编辑：

感谢三个非常有用的答案，我知道如何回答问题的第一部分 - 即计算有多少客户购买了某种组合。然而，第二部分仍然没有答案。我想知道哪些客户购买了哪些组合。

Answer 1

这里有一些肮脏的代码，允许您设置参数 n_items 来控制捆绑包的最大大小：

library(magrittr)
DT_melt <- DT[, melt(.SD, id.vars = "customerID", variable.factor = FALSE)
              ][value == 1
                ][, variable := as.integer(sub("item", "", variable))]
n_items <- 4L
keep_track <- list()
for (i in seq_len(n_items)) {
  combs <- combn(seq_len(n_items), i)
  keep_track[[i]] <- apply(combs, 2, function(x)  DT_melt[, all(x %in% variable), by = customerID]) %>%
    lapply(function(x) sum(x[[2]])) %>% 
    setNames(apply(combs, 2, function(x) paste(paste0("item", x), collapse = ";")))
}
unlist(keep_track)

Returns 一个命名的计数向量：

#                   item1                   item2 
#                       3                       4 
#                   item3                   item4 
#                       2                       4 
#             item1;item2             item1;item3 
#                       3                       2 
#             item1;item4             item2;item3 
#                       2                       2 
#             item2;item4             item3;item4 
#                       3                       1 
#       item1;item2;item3       item1;item2;item4 
#                       2                       2 
#       item1;item3;item4       item2;item3;item4 
#                       1                       1 
# item1;item2;item3;item4 
#                       1

Answer 2

这是一个完全基于 R 的选项，因此将数据转换为数据帧

df <- data.frame(df)
unique_product <- names(df[-1])

stack(unlist(sapply(seq_along(unique_product), function(x) 
     combn(unique_product, x, FUN = function(y) 
           setNames(sum(rowSums(df[y] == 1) == length(y)), 
            paste0(y, collapse = ";")), simplify = FALSE))))


#   values                     ind
#1       3                   item1
#2       4                   item2
#3       2                   item3
#4       4                   item4
#5       3             item1;item2
#6       2             item1;item3
#7       2             item1;item4
#8       2             item2;item3
#9       3             item2;item4
#10      1             item3;item4
#11      2       item1;item2;item3
#12      2       item1;item2;item4
#13      1       item1;item3;item4
#14      1       item2;item3;item4
#15      1 item1;item2;item3;item4

我们使用 combn 创建每个唯一产品的所有组合，并通过从数据框中对相应列进行子集化，计算每个组合中有多少产品同时出现。

为了获得解锁某些组合的客户，我们可以继续使用相同的方法

stack(unlist(sapply(seq_along(unique_product), function(x) 
     combn(unique_product, x, FUN = function(y) {
      inds <- rowSums(df[x] == 1) == length(x)
      setNames(df$customerID[inds], 
             rep(paste0(y, collapse = ";"), sum(inds)))
             }, simplify = FALSE))))

#   values                     ind
#1       1                   item1
#2       1                   item2
#3       1                   item3
#4       1                   item4
#5       1             item1;item2
#6       4             item1;item2
#7       5             item1;item2
#8       1             item1;item3
#9       4             item1;item3
#10      5             item1;item3
#....

您可以根据需要重命名列，但此处 values 是客户 ID，ind 是各个客户解锁的组合。

Answer 3

使用 baseR 和 data.table

的逐步方法

示例数据

DT <- data.table(customerID = c(1,2,3,4,5),
           item1 = c(1,0,0,1,1),
           item2 = c(1,0,1,1,1),
           item3 = c(1,0,0,0,1),
           item4 = c(0,1,1,1,1))

代码

#identify columns with items, grab their names
cols <- names(DT[,-1])

在下面的代码中：如果您想要最多 n 个产品的组合，请将 1:length(cols) 设置为 1:n

#put all combinations of items in a list
combos <- unlist( lapply( 1:length(cols), combn, x = cols, simplify = FALSE ), recursive = FALSE )

#calculate number of sold items per combo
l <- lapply( combos, function(x) {
  nrow( DT[ rowSums( DT[, x, with = FALSE ] ) == length( x ), ] )
})

#name the list based on the combo
names(l) <- lapply( combos, paste0, collapse = ";")

输出

str( l )

List of 15
$ item1                  : int 3
$ item2                  : int 4
$ item3                  : int 2
$ item4                  : int 4
$ item1;item2            : int 3
$ item1;item3            : int 2
$ item1;item4            : int 2
$ item2;item3            : int 2
$ item2;item4            : int 3
$ item3;item4            : int 1
$ item1;item2;item3      : int 2
$ item1;item2;item4      : int 2
$ item1;item3;item4      : int 1
$ item2;item3;item4      : int 1
$ item1;item2;item3;item4: int 1

或创建 data.table

as.data.table( as.matrix( unlist(l), ncol = 2, nrow = length(l) ), keep.rownames = TRUE )

#                         rn V1
# 1:                   item1  3
# 2:                   item2  4
# 3:                   item3  2
# 4:                   item4  4
# 5:             item1;item2  3
# 6:             item1;item3  2
# 7:             item1;item4  2
# 8:             item2;item3  2
# 9:             item2;item4  3
#10:             item3;item4  1
#11:       item1;item2;item3  2
#12:       item1;item2;item4  2
#13:       item1;item3;item4  1
#14:       item2;item3;item4  1
#15: item1;item2;item3;item4  1

如何查找商品及其组合的购买次数？

How to find number of times items and their combinations were purchased?

r

market-basket-analysis

data.table

编辑：