特定天数后购买的产品分析

Analysis of products purchased after certain days

我一直在尝试对一段时间后购买的产品进行顺序分析,比如客户在 7 天后购买了哪些产品组合,购买这种组合的客户比例是多少,我尝试过 arulesSequence 包但是我的数据的结构与包装不符,我在列中有用户 ID、购买日期、产品 ID 和产品名称,我搜索了很多但没有任何明确的方法。

Dayy        UID         leaf_category_name  leaf_category_id
5/1/2018    47      Cubes               38860
5/1/2018    272     Pastas & Noodles    34616
5/1/2018    1827    Flavours & Spices   34619
5/1/2018    3505    Feature Phones      1506

我有这样的数据,UID代表user id,leaf category简单来说就是购买的产品。 我有一个包含 2,049,278 行的庞大数据集。

我试过的代码-

library(Matrix)
library(arules)
library(arulesSequences)

library(arulesViz)

#splitting data into transactions
transactions <- as(split(data$leaf_category_id, data$UID), "transactions")

frequent_sequences <- cspade(transactions, parameter=list(support=0.5))

# Convert tabular data to sequences. Item is in
# column 1, sequence ID is column 2, and event ID is column 3.
seqs = make_sequences(data, item_col = 1, sid_col = 2, eid_col = 3)             

# generate frequent sequential patterns with minimum
# support of 0.1 and maximum of 6 elements
fseq = spade(seqs, 0.1, 6)

我想查看特定天数后购买的产品序列。 有人可以帮我解决这个问题吗?

谢谢

apriori 路径非常好,但是,没有您的数据,我们可以使用著名的数据集作为示例,例如 Groceries(在您的情况下,您可以在您想要的数据之后对数据进行子集化) :

library(arules)
data(Groceries)

# here you can see the product with the biggest support
frequentproducts <- eclat (Groceries, parameter = list(supp = 0.07, maxlen = 15)) 
inspect(frequentItems)
     items                         support    count
[1]  {other vegetables,whole milk} 0.07483477  736 
[2]  {whole milk}                  0.25551601 2513 
[3]  {other vegetables}            0.19349263 1903 
[4]  {rolls/buns}                  0.18393493 1809 
[5]  {yogurt}                      0.13950178 1372 
[6]  {soda}                        0.17437722 1715 
[7]  {root vegetables}             0.10899847 1072 
[8]  {tropical fruit}              0.10493137 1032 
[9]  {bottled water}               0.11052364 1087 
[10] {sausage}                     0.09395018  924 
[11] {shopping bags}               0.09852567  969 
[12] {citrus fruit}                0.08276563  814 
[13] {pastry}                      0.08896797  875 
[14] {pip fruit}                   0.07564820  744 
[15] {whipped/sour cream}          0.07168277  705 
[16] {fruit/vegetable juice}       0.07229283  711 
[17] {newspapers}                  0.07981698  785 
[18] {bottled beer}                0.08052872  792 
[19] {canned beer}                 0.07768175  764 

如果你喜欢,你可以绘制它:

itemFrequencyPlot(Groceries, topN=5, type="absolute")

然后可以看到关联规则:

association <- apriori (Groceries, parameter = list(supp = 0.001, conf = 0.5)) 
inspect(head(association_conf))


  lhs                                           rhs                support     confidence lift     count
[1] {rice,sugar}                               => {whole milk}       0.001220132 1          3.913649 12   
[2] {canned fish,hygiene articles}             => {whole milk}       0.001118454 1          3.913649 11   
[3] {root vegetables,butter,rice}              => {whole milk}       0.001016777 1          3.913649 10   
[4] {root vegetables,whipped/sour cream,flour} => {whole milk}       0.001728521 1          3.913649 17   
[5] {butter,soft cheese,domestic eggs}         => {whole milk}       0.001016777 1          3.913649 10   
[6] {citrus fruit,root vegetables,soft cheese} => {other vegetables} 0.001016777 1          5.168156 10   

您可以在最后一列中看到计数,每个规则出现了多少次:这可以读作 "how many rows",如果每行都是一个客户,则为客户的数量。但是,如果您想要例如 a,b,a,c >>> count = 4a,b,a,c >>> count 3(伪代码),您必须考虑对多少客户意味着什么。在这种情况下,您必须评估您的数据。
编辑
你终于可以看看this,正如你所说,还有cspade算法可以提供帮助。