特定天数后购买的产品分析
Analysis of products purchased after certain days
我一直在尝试对一段时间后购买的产品进行顺序分析,比如客户在 7 天后购买了哪些产品组合,购买这种组合的客户比例是多少,我尝试过 arulesSequence 包但是我的数据的结构与包装不符,我在列中有用户 ID、购买日期、产品 ID 和产品名称,我搜索了很多但没有任何明确的方法。
Dayy UID leaf_category_name leaf_category_id
5/1/2018 47 Cubes 38860
5/1/2018 272 Pastas & Noodles 34616
5/1/2018 1827 Flavours & Spices 34619
5/1/2018 3505 Feature Phones 1506
我有这样的数据,UID代表user id,leaf category简单来说就是购买的产品。
我有一个包含 2,049,278 行的庞大数据集。
我试过的代码-
library(Matrix)
library(arules)
library(arulesSequences)
library(arulesViz)
#splitting data into transactions
transactions <- as(split(data$leaf_category_id, data$UID), "transactions")
frequent_sequences <- cspade(transactions, parameter=list(support=0.5))
和
# Convert tabular data to sequences. Item is in
# column 1, sequence ID is column 2, and event ID is column 3.
seqs = make_sequences(data, item_col = 1, sid_col = 2, eid_col = 3)
# generate frequent sequential patterns with minimum
# support of 0.1 and maximum of 6 elements
fseq = spade(seqs, 0.1, 6)
我想查看特定天数后购买的产品序列。
有人可以帮我解决这个问题吗?
谢谢
apriori 路径非常好,但是,没有您的数据,我们可以使用著名的数据集作为示例,例如 Groceries(在您的情况下,您可以在您想要的数据之后对数据进行子集化) :
library(arules)
data(Groceries)
# here you can see the product with the biggest support
frequentproducts <- eclat (Groceries, parameter = list(supp = 0.07, maxlen = 15))
inspect(frequentItems)
items support count
[1] {other vegetables,whole milk} 0.07483477 736
[2] {whole milk} 0.25551601 2513
[3] {other vegetables} 0.19349263 1903
[4] {rolls/buns} 0.18393493 1809
[5] {yogurt} 0.13950178 1372
[6] {soda} 0.17437722 1715
[7] {root vegetables} 0.10899847 1072
[8] {tropical fruit} 0.10493137 1032
[9] {bottled water} 0.11052364 1087
[10] {sausage} 0.09395018 924
[11] {shopping bags} 0.09852567 969
[12] {citrus fruit} 0.08276563 814
[13] {pastry} 0.08896797 875
[14] {pip fruit} 0.07564820 744
[15] {whipped/sour cream} 0.07168277 705
[16] {fruit/vegetable juice} 0.07229283 711
[17] {newspapers} 0.07981698 785
[18] {bottled beer} 0.08052872 792
[19] {canned beer} 0.07768175 764
如果你喜欢,你可以绘制它:
itemFrequencyPlot(Groceries, topN=5, type="absolute")
然后可以看到关联规则:
association <- apriori (Groceries, parameter = list(supp = 0.001, conf = 0.5))
inspect(head(association_conf))
lhs rhs support confidence lift count
[1] {rice,sugar} => {whole milk} 0.001220132 1 3.913649 12
[2] {canned fish,hygiene articles} => {whole milk} 0.001118454 1 3.913649 11
[3] {root vegetables,butter,rice} => {whole milk} 0.001016777 1 3.913649 10
[4] {root vegetables,whipped/sour cream,flour} => {whole milk} 0.001728521 1 3.913649 17
[5] {butter,soft cheese,domestic eggs} => {whole milk} 0.001016777 1 3.913649 10
[6] {citrus fruit,root vegetables,soft cheese} => {other vegetables} 0.001016777 1 5.168156 10
您可以在最后一列中看到计数,每个规则出现了多少次:这可以读作 "how many rows",如果每行都是一个客户,则为客户的数量。但是,如果您想要例如 a,b,a,c >>> count = 4
或 a,b,a,c >>> count 3
(伪代码),您必须考虑对多少客户意味着什么。在这种情况下,您必须评估您的数据。
编辑
你终于可以看看this,正如你所说,还有cspade算法可以提供帮助。
我一直在尝试对一段时间后购买的产品进行顺序分析,比如客户在 7 天后购买了哪些产品组合,购买这种组合的客户比例是多少,我尝试过 arulesSequence 包但是我的数据的结构与包装不符,我在列中有用户 ID、购买日期、产品 ID 和产品名称,我搜索了很多但没有任何明确的方法。
Dayy UID leaf_category_name leaf_category_id
5/1/2018 47 Cubes 38860
5/1/2018 272 Pastas & Noodles 34616
5/1/2018 1827 Flavours & Spices 34619
5/1/2018 3505 Feature Phones 1506
我有这样的数据,UID代表user id,leaf category简单来说就是购买的产品。 我有一个包含 2,049,278 行的庞大数据集。
我试过的代码-
library(Matrix)
library(arules)
library(arulesSequences)
library(arulesViz)
#splitting data into transactions
transactions <- as(split(data$leaf_category_id, data$UID), "transactions")
frequent_sequences <- cspade(transactions, parameter=list(support=0.5))
和
# Convert tabular data to sequences. Item is in
# column 1, sequence ID is column 2, and event ID is column 3.
seqs = make_sequences(data, item_col = 1, sid_col = 2, eid_col = 3)
# generate frequent sequential patterns with minimum
# support of 0.1 and maximum of 6 elements
fseq = spade(seqs, 0.1, 6)
我想查看特定天数后购买的产品序列。 有人可以帮我解决这个问题吗?
谢谢
apriori 路径非常好,但是,没有您的数据,我们可以使用著名的数据集作为示例,例如 Groceries(在您的情况下,您可以在您想要的数据之后对数据进行子集化) :
library(arules)
data(Groceries)
# here you can see the product with the biggest support
frequentproducts <- eclat (Groceries, parameter = list(supp = 0.07, maxlen = 15))
inspect(frequentItems)
items support count
[1] {other vegetables,whole milk} 0.07483477 736
[2] {whole milk} 0.25551601 2513
[3] {other vegetables} 0.19349263 1903
[4] {rolls/buns} 0.18393493 1809
[5] {yogurt} 0.13950178 1372
[6] {soda} 0.17437722 1715
[7] {root vegetables} 0.10899847 1072
[8] {tropical fruit} 0.10493137 1032
[9] {bottled water} 0.11052364 1087
[10] {sausage} 0.09395018 924
[11] {shopping bags} 0.09852567 969
[12] {citrus fruit} 0.08276563 814
[13] {pastry} 0.08896797 875
[14] {pip fruit} 0.07564820 744
[15] {whipped/sour cream} 0.07168277 705
[16] {fruit/vegetable juice} 0.07229283 711
[17] {newspapers} 0.07981698 785
[18] {bottled beer} 0.08052872 792
[19] {canned beer} 0.07768175 764
如果你喜欢,你可以绘制它:
itemFrequencyPlot(Groceries, topN=5, type="absolute")
然后可以看到关联规则:
association <- apriori (Groceries, parameter = list(supp = 0.001, conf = 0.5))
inspect(head(association_conf))
lhs rhs support confidence lift count
[1] {rice,sugar} => {whole milk} 0.001220132 1 3.913649 12
[2] {canned fish,hygiene articles} => {whole milk} 0.001118454 1 3.913649 11
[3] {root vegetables,butter,rice} => {whole milk} 0.001016777 1 3.913649 10
[4] {root vegetables,whipped/sour cream,flour} => {whole milk} 0.001728521 1 3.913649 17
[5] {butter,soft cheese,domestic eggs} => {whole milk} 0.001016777 1 3.913649 10
[6] {citrus fruit,root vegetables,soft cheese} => {other vegetables} 0.001016777 1 5.168156 10
您可以在最后一列中看到计数,每个规则出现了多少次:这可以读作 "how many rows",如果每行都是一个客户,则为客户的数量。但是,如果您想要例如 a,b,a,c >>> count = 4
或 a,b,a,c >>> count 3
(伪代码),您必须考虑对多少客户意味着什么。在这种情况下,您必须评估您的数据。
编辑
你终于可以看看this,正如你所说,还有cspade算法可以提供帮助。