从 Apriori 中移除反向 (reverse/duplicate) 规则导致 R
Removing inverted (reverse/duplicate) rules from Apriori result in R
我已经在我的数据集上实现了 Apriori 算法。我得到的规则是颠倒的重复,即:
inspect(head(rules))
lhs rhs support confidence lift count
[1] {252-ON-OFF} => {L30-ATLANTIC} 0.04545455 1 22 1
[2] {L30-ATLANTIC} => {252-ON-OFF} 0.04545455 1 22 1
[3] {252-ON-OFF} => {M01-A molle biconiche} 0.04545455 1 22 1
[4] {M01-A molle biconiche} => {252-ON-OFF} 0.04545455 1 22 1
[5] {L30-ATLANTIC} => {M01-A molle biconiche} 0.04545455 1 22 1
[6] {M01-A molle biconiche} => {L30-ATLANTIC} 0.04545455 1 22 1
可以看出规则 1 和规则 2 是相同的,只是 LHS 和 RHS 互换了。有什么方法可以从最终结果中删除这些规则吗?
我看到了 post link 但建议的解决方案不正确。
我也看到了这个 post 并且我尝试了这两个解决方案:
解决方案 A:
rules <- rules[!is.redundant(rules)]
但结果总是一样的:
inspect(head(rules))
lhs rhs support confidence lift count
[1] {252-ON-OFF} => {L30-ATLANTIC} 0.04545455 1 22 1
[2] {L30-ATLANTIC} => {252-ON-OFF} 0.04545455 1 22 1
[3] {252-ON-OFF} => {M01-A molle biconiche} 0.04545455 1 22 1
[4] {M01-A molle biconiche} => {252-ON-OFF} 0.04545455 1 22 1
[5] {L30-ATLANTIC} => {M01-A molle biconiche} 0.04545455 1 22 1
[6] {M01-A molle biconiche} => {L30-ATLANTIC} 0.04545455 1 22 1
解决方案 B:
# find redundant rules
subset.matrix <- is.subset(rules, rules)
subset.matrix[lower.tri(subset.matrix, diag=T)]
redundant <- colSums(subset.matrix, na.rm=T) > 1
which(redundant)
rules.pruned <- rules[!redundant]
inspect(rules.pruned)
lhs rhs support confidence lift count
[1] {} => {BRC-BRC} 0.04545455 0.04545455 1 1
[2] {} => {111-WINK} 0.04545455 0.04545455 1 1
[3] {} => {305-INGRAM HIGH} 0.04545455 0.04545455 1 1
[4] {} => {952-REVERS} 0.04545455 0.04545455 1 1
[5] {} => {002-LC2} 0.09090909 0.09090909 1 2
[6] {} => {252-ON-OFF} 0.04545455 0.04545455 1 1
[7] {} => {L30-ATLANTIC} 0.04545455 0.04545455 1 1
[8] {} => {M01-A molle biconiche} 0.04545455 0.04545455 1 1
[9] {} => {678-Portovenere} 0.04545455 0.04545455 1 1
[10] {} => {251-MET T.} 0.04545455 0.04545455 1 1
[11] {} => {324-D.S.3} 0.04545455 0.04545455 1 1
[12] {} => {L04-YUME} 0.04545455 0.04545455 1 1
[13] {} => {969-Lubekka} 0.04545455 0.04545455 1 1
[14] {} => {000-FUORI LISTINO} 0.04545455 0.04545455 1 1
[15] {} => {007-LC7} 0.04545455 0.04545455 1 1
[16] {} => {341-COS} 0.04545455 0.04545455 1 1
[17] {} => {601-ROBIE 1} 0.04545455 0.04545455 1 1
[18] {} => {608-TALIESIN 2} 0.04545455 0.04545455 1 1
[19] {} => {610-ROBIE 2} 0.04545455 0.04545455 1 1
[20] {} => {615-HUSSER} 0.04545455 0.04545455 1 1
[21] {} => {831-DAKOTA} 0.04545455 0.04545455 1 1
[22] {} => {997-997} 0.27272727 0.27272727 1 6
[23] {} => {412-CAB} 0.09090909 0.09090909 1 2
[24] {} => {S01-A doghe senza movimenti} 0.09090909 0.09090909 1 2
[25] {} => {708-Genoa} 0.09090909 0.09090909 1 2
[26] {} => {998-998} 0.54545455 0.54545455 1 12
有没有人遇到同样的问题并且知道如何解决?感谢您的帮助
您可以通过将您的规则对象转换为 data.frame 并迭代比较 LHS/RHS 事务向量来使用蛮力来完成。这是一个使用 grocery.csv dataset:
的例子
inspect(head(groceryrules))
# convert rules object to data.frame
trans_frame <- data.frame(lhs = labels(lhs(groceryrules)), rhs = labels(rhs(groceryrules)), groceryrules@quality)
# loop through each row of trans_frame
rem_indx <- NULL
for(i in 1:nrow(trans_frame)) {
trans_vec_a <- c(as.character(trans_frame[i,1]), as.character(trans_frame[i,2]))
# for each row evaluated, compare to every other row in trans_frame
for(k in 1:nrow(trans_frame[-i,])) {
trans_vec_b <- c(as.character(trans_frame[-i,][k,1]), as.character(trans_frame[-i,][k,2]))
if(setequal(trans_vec_a, trans_vec_b)) {
# store the index to remove
rem_indx[i] <- i
}
}
}
这为您提供了一个应删除的索引向量(因为它们是 duplicate/inverted)
duped_trans <- trans_frame[rem_indx[!is.na(rem_indx)], ]
duped_trans
我们可以看到它确定了 duplicates/inverts 的 2 笔交易。
现在我们可以保留非重复交易:
deduped_trans <- trans_frame[-rem_indx[!is.na(rem_indx)], ]
问题当然是上面的算法效率极低。杂货店数据集只有 463 笔交易。对于任何合理数量的交易,您都需要对该函数进行向量化。
问题是您的数据集,而不是算法。在结果中,您看到许多规则的计数为 1(项目组合在事务中出现一次)并且规则的置信度为 1,其 "inverse." 这意味着您需要更多数据并增加最小支持度。
如果您仍想有效地摆脱此类 "duplicate" 规则,则可以执行以下操作:
> library(arules)
> data(Groceries)
> rules <- apriori(Groceries, parameter = list(support = 0.001))
> rules
set of 410 rules
> gi <- generatingItemsets(rules)
> d <- which(duplicated(gi))
> rules[-d]
set of 385 rules
代码只保留每组规则的第一条规则完全相同
我已经在我的数据集上实现了 Apriori 算法。我得到的规则是颠倒的重复,即:
inspect(head(rules))
lhs rhs support confidence lift count
[1] {252-ON-OFF} => {L30-ATLANTIC} 0.04545455 1 22 1
[2] {L30-ATLANTIC} => {252-ON-OFF} 0.04545455 1 22 1
[3] {252-ON-OFF} => {M01-A molle biconiche} 0.04545455 1 22 1
[4] {M01-A molle biconiche} => {252-ON-OFF} 0.04545455 1 22 1
[5] {L30-ATLANTIC} => {M01-A molle biconiche} 0.04545455 1 22 1
[6] {M01-A molle biconiche} => {L30-ATLANTIC} 0.04545455 1 22 1
可以看出规则 1 和规则 2 是相同的,只是 LHS 和 RHS 互换了。有什么方法可以从最终结果中删除这些规则吗?
我看到了 post link 但建议的解决方案不正确。
我也看到了这个 post
解决方案 A:
rules <- rules[!is.redundant(rules)]
但结果总是一样的:
inspect(head(rules))
lhs rhs support confidence lift count
[1] {252-ON-OFF} => {L30-ATLANTIC} 0.04545455 1 22 1
[2] {L30-ATLANTIC} => {252-ON-OFF} 0.04545455 1 22 1
[3] {252-ON-OFF} => {M01-A molle biconiche} 0.04545455 1 22 1
[4] {M01-A molle biconiche} => {252-ON-OFF} 0.04545455 1 22 1
[5] {L30-ATLANTIC} => {M01-A molle biconiche} 0.04545455 1 22 1
[6] {M01-A molle biconiche} => {L30-ATLANTIC} 0.04545455 1 22 1
解决方案 B:
# find redundant rules
subset.matrix <- is.subset(rules, rules)
subset.matrix[lower.tri(subset.matrix, diag=T)]
redundant <- colSums(subset.matrix, na.rm=T) > 1
which(redundant)
rules.pruned <- rules[!redundant]
inspect(rules.pruned)
lhs rhs support confidence lift count
[1] {} => {BRC-BRC} 0.04545455 0.04545455 1 1
[2] {} => {111-WINK} 0.04545455 0.04545455 1 1
[3] {} => {305-INGRAM HIGH} 0.04545455 0.04545455 1 1
[4] {} => {952-REVERS} 0.04545455 0.04545455 1 1
[5] {} => {002-LC2} 0.09090909 0.09090909 1 2
[6] {} => {252-ON-OFF} 0.04545455 0.04545455 1 1
[7] {} => {L30-ATLANTIC} 0.04545455 0.04545455 1 1
[8] {} => {M01-A molle biconiche} 0.04545455 0.04545455 1 1
[9] {} => {678-Portovenere} 0.04545455 0.04545455 1 1
[10] {} => {251-MET T.} 0.04545455 0.04545455 1 1
[11] {} => {324-D.S.3} 0.04545455 0.04545455 1 1
[12] {} => {L04-YUME} 0.04545455 0.04545455 1 1
[13] {} => {969-Lubekka} 0.04545455 0.04545455 1 1
[14] {} => {000-FUORI LISTINO} 0.04545455 0.04545455 1 1
[15] {} => {007-LC7} 0.04545455 0.04545455 1 1
[16] {} => {341-COS} 0.04545455 0.04545455 1 1
[17] {} => {601-ROBIE 1} 0.04545455 0.04545455 1 1
[18] {} => {608-TALIESIN 2} 0.04545455 0.04545455 1 1
[19] {} => {610-ROBIE 2} 0.04545455 0.04545455 1 1
[20] {} => {615-HUSSER} 0.04545455 0.04545455 1 1
[21] {} => {831-DAKOTA} 0.04545455 0.04545455 1 1
[22] {} => {997-997} 0.27272727 0.27272727 1 6
[23] {} => {412-CAB} 0.09090909 0.09090909 1 2
[24] {} => {S01-A doghe senza movimenti} 0.09090909 0.09090909 1 2
[25] {} => {708-Genoa} 0.09090909 0.09090909 1 2
[26] {} => {998-998} 0.54545455 0.54545455 1 12
有没有人遇到同样的问题并且知道如何解决?感谢您的帮助
您可以通过将您的规则对象转换为 data.frame 并迭代比较 LHS/RHS 事务向量来使用蛮力来完成。这是一个使用 grocery.csv dataset:
的例子inspect(head(groceryrules))
# convert rules object to data.frame
trans_frame <- data.frame(lhs = labels(lhs(groceryrules)), rhs = labels(rhs(groceryrules)), groceryrules@quality)
# loop through each row of trans_frame
rem_indx <- NULL
for(i in 1:nrow(trans_frame)) {
trans_vec_a <- c(as.character(trans_frame[i,1]), as.character(trans_frame[i,2]))
# for each row evaluated, compare to every other row in trans_frame
for(k in 1:nrow(trans_frame[-i,])) {
trans_vec_b <- c(as.character(trans_frame[-i,][k,1]), as.character(trans_frame[-i,][k,2]))
if(setequal(trans_vec_a, trans_vec_b)) {
# store the index to remove
rem_indx[i] <- i
}
}
}
这为您提供了一个应删除的索引向量(因为它们是 duplicate/inverted)
duped_trans <- trans_frame[rem_indx[!is.na(rem_indx)], ]
duped_trans
我们可以看到它确定了 duplicates/inverts 的 2 笔交易。
现在我们可以保留非重复交易:
deduped_trans <- trans_frame[-rem_indx[!is.na(rem_indx)], ]
问题当然是上面的算法效率极低。杂货店数据集只有 463 笔交易。对于任何合理数量的交易,您都需要对该函数进行向量化。
问题是您的数据集,而不是算法。在结果中,您看到许多规则的计数为 1(项目组合在事务中出现一次)并且规则的置信度为 1,其 "inverse." 这意味着您需要更多数据并增加最小支持度。
如果您仍想有效地摆脱此类 "duplicate" 规则,则可以执行以下操作:
> library(arules)
> data(Groceries)
> rules <- apriori(Groceries, parameter = list(support = 0.001))
> rules
set of 410 rules
> gi <- generatingItemsets(rules)
> d <- which(duplicated(gi))
> rules[-d]
set of 385 rules
代码只保留每组规则的第一条规则完全相同