如何提取最长的先验规则(关联规则)
How to extract the longest apriori rule (association rule)
当使用下面的例子时:
library("arules")
data("Adult")
## Mine association rules.
rules <- apriori(Adult,parameter = list(supp = 0.5, conf = 0.9, target = "rules"))
> labels(rules)
您将看到以下规则:
[5] "{sex=Male} => {capital-gain=None}"
[20] "{race=White,sex=Male} => {capital-gain=None}"
[22] "{sex=Male,native-country=United-States} => {capital-gain=None}"
具有相同的 RHS
,但 LHS
不同。
我只想获得最长的 LHS 规则并忽略较短的规则。
在上面提到的示例中,我想省略规则 [5],因为它包含在 [20] 和 [22] 中。 ({sex=Male} 包含在 [20] 和 [22] 中)。我只想保留最长的规则(在其他示例中,最长的规则可以包含 3 个或更多组件)。
使用is.subset
得到一个逻辑矩阵,并使用该矩阵定位非子集:
subsets <- is.subset(rules, proper = TRUE)
subsets[lower.tri(subsets, diag=TRUE)] <- 0 # set lower triangle to 0
notsubsets <- rowSums(subsets) == 0L
labels(rules[notsubsets])
# [1] "{capital-gain=None,hours-per-week=Full-time} => {capital-loss=None}"
# [2] "{capital-loss=None,hours-per-week=Full-time} => {capital-gain=None}"
# [3] "{race=White,sex=Male} => {capital-gain=None}"
# [4] "{race=White,sex=Male,native-country=United-States} => {capital-loss=None}"
# [5] "{race=White,sex=Male,capital-loss=None} => {native-country=United-States}"
# [6] "{sex=Male,capital-loss=None,native-country=United-States} => {race=White}"
# [7] "{sex=Male,capital-gain=None,native-country=United-States} => {capital-loss=None}"
# [8] "{workclass=Private,race=White,native-country=United-States} => {capital-loss=None}"
# [9] "{workclass=Private,race=White,capital-loss=None} => {native-country=United-States}"
#[10] "{workclass=Private,race=White,capital-gain=None} => {capital-loss=None}"
#[11] "{workclass=Private,race=White,capital-loss=None} => {capital-gain=None}"
#[12] "{workclass=Private,capital-gain=None,native-country=United-States} => {capital-loss=None}"
#[13] "{workclass=Private,capital-loss=None,native-country=United-States} => {capital-gain=None}"
#[14] "{race=White,capital-gain=None,native-country=United-States} => {capital-loss=None}"
#[15] "{race=White,capital-loss=None,native-country=United-States} => {capital-gain=None}"
#[16] "{race=White,capital-gain=None,capital-loss=None} => {native-country=United-States}"
is.subset
在评估它是否重复时计算右侧,这是这种方法的问题。如评论中所述,上述方法错过了规则 {sex=Male,native-country=United-States} => {capital-gain=None}
:
labels(rules[c(22, 43)])
#[1] "{sex=Male,native-country=United-States} => {capital-gain=None}"
#[2] "{sex=Male,capital-gain=None,native-country=United-States} => {capital-loss=None}"
is.subset(rules[22], rules[43])
要获得这些情况,您可以使用 <= 1L
而不是 == 0L
,但是您也会得到误报("{sex=Male,capital-gain=None} => {capital-loss=None}"
是 [=18 的子集=].
当使用下面的例子时:
library("arules")
data("Adult")
## Mine association rules.
rules <- apriori(Adult,parameter = list(supp = 0.5, conf = 0.9, target = "rules"))
> labels(rules)
您将看到以下规则:
[5] "{sex=Male} => {capital-gain=None}"
[20] "{race=White,sex=Male} => {capital-gain=None}"
[22] "{sex=Male,native-country=United-States} => {capital-gain=None}"
具有相同的 RHS
,但 LHS
不同。
我只想获得最长的 LHS 规则并忽略较短的规则。
在上面提到的示例中,我想省略规则 [5],因为它包含在 [20] 和 [22] 中。 ({sex=Male} 包含在 [20] 和 [22] 中)。我只想保留最长的规则(在其他示例中,最长的规则可以包含 3 个或更多组件)。
使用is.subset
得到一个逻辑矩阵,并使用该矩阵定位非子集:
subsets <- is.subset(rules, proper = TRUE)
subsets[lower.tri(subsets, diag=TRUE)] <- 0 # set lower triangle to 0
notsubsets <- rowSums(subsets) == 0L
labels(rules[notsubsets])
# [1] "{capital-gain=None,hours-per-week=Full-time} => {capital-loss=None}"
# [2] "{capital-loss=None,hours-per-week=Full-time} => {capital-gain=None}"
# [3] "{race=White,sex=Male} => {capital-gain=None}"
# [4] "{race=White,sex=Male,native-country=United-States} => {capital-loss=None}"
# [5] "{race=White,sex=Male,capital-loss=None} => {native-country=United-States}"
# [6] "{sex=Male,capital-loss=None,native-country=United-States} => {race=White}"
# [7] "{sex=Male,capital-gain=None,native-country=United-States} => {capital-loss=None}"
# [8] "{workclass=Private,race=White,native-country=United-States} => {capital-loss=None}"
# [9] "{workclass=Private,race=White,capital-loss=None} => {native-country=United-States}"
#[10] "{workclass=Private,race=White,capital-gain=None} => {capital-loss=None}"
#[11] "{workclass=Private,race=White,capital-loss=None} => {capital-gain=None}"
#[12] "{workclass=Private,capital-gain=None,native-country=United-States} => {capital-loss=None}"
#[13] "{workclass=Private,capital-loss=None,native-country=United-States} => {capital-gain=None}"
#[14] "{race=White,capital-gain=None,native-country=United-States} => {capital-loss=None}"
#[15] "{race=White,capital-loss=None,native-country=United-States} => {capital-gain=None}"
#[16] "{race=White,capital-gain=None,capital-loss=None} => {native-country=United-States}"
is.subset
在评估它是否重复时计算右侧,这是这种方法的问题。如评论中所述,上述方法错过了规则 {sex=Male,native-country=United-States} => {capital-gain=None}
:
labels(rules[c(22, 43)])
#[1] "{sex=Male,native-country=United-States} => {capital-gain=None}"
#[2] "{sex=Male,capital-gain=None,native-country=United-States} => {capital-loss=None}"
is.subset(rules[22], rules[43])
要获得这些情况,您可以使用 <= 1L
而不是 == 0L
,但是您也会得到误报("{sex=Male,capital-gain=None} => {capital-loss=None}"
是 [=18 的子集=].