R源代码中的Apriori算法

Apriori algorithm in R source code

我正在尝试用 R 代码编写先验算法。首先,我想计算列表中每个项目的频率。我的初始代码如下:

a_list <- list(c("I1","I2","I5"),
           c("I2","I4"),
           c("I2","I3"),
           c("I1","I2","I4"),
           c("I1","I3"),
           c("I2","I3"),
           c("I1","I3"),
           c("I1","I2","I3","I5"),
           c("I1","I2","I3"))
sapply(a_list, function(x) length(x))
un <- unique(unlist(a_list))
nm <- lapply(un, function(x) sapply(a_list, function(y) sum(y == x)))
names(nm) <- un
nm

我得到的结果是:

> nm

$I1
[1] 1 0 0 1 1 0 1 1 1

$I2
[1] 1 1 1 1 0 1 0 1 1

$I5
[1] 1 0 0 0 0 0 0 1 0

$I4
[1] 0 1 0 1 0 0 0 0 0

$I3
[1] 0 0 1 0 1 1 1 1 1

但是,我希望它被安排为(也许重新排列在矩阵或数组中,然后我可以进一步处理它):

> nm

I1 6
I2 7
I3 6
I4 2
I5 2

每个项目显示频率计数并按字母顺序排列。有什么办法可以实现吗?我尝试了 cbind、apply、relist,但还没有找到解决方案。谢谢

更新:

library(dplyr)
a_list <- list(c("I1","I2","I5"),
           c("I2","I4"),
           c("I2","I3"),
           c("I1","I2","I4"),
           c("I1","I3"),
           c("I2","I3"),
           c("I1","I3"),
           c("I1","I2","I3","I5"),
           c("I1","I2","I3"))
a <- unlist(a_list) %>% table %>% data.frame
a
minsupport = 3
b <- data.frame(a)
c <- b[b$Freq > minsupport,]
c

现在我得到的结果是:

> a
   . Freq
1 I1    6
2 I2    7
3 I3    6
4 I4    2
5 I5    2

> c
   . Freq
1 I1    6
2 I2    7
3 I3    6

然后如何通过扫描原始列表设置 "I1,I2", ...,"I2,I3" 的组合?

更新日期: 我尝试了如下组合,它输出了一个矩阵。

> combn(c$.,2)
     [,1] [,2] [,3]
[1,] I1   I1   I2  
[2,] I2   I3   I3  
Levels: I1 I2 I3 I4 I5

进一步修改为:

d <- combn(c$.,2)
result <- unique(sapply(d,function(i) paste(d[,i],collapse=",")))
result

我的结果是:

> result
[1] "I1,I2" "I1,I3" "I2,I3"

接下来是从原始 "a_list" 中统计上述项目集的频率。也许输出为

更好
""I1","I2"", ""I1","I3"", ""I2","I3""

为了与原榜单进行比较

如何从原始 a_list 中获取此矩阵中项目集的频率? 先验算法要求扫描所有不小于最小支持度的项集,从1维开始(即"I1"、"I2"、...、"I5" in a_list)到2维(即 "I1,I2" "I1,I3" "I2,I3" 在这种情况下),然后继续,如果适用(例如 "I1,I2,I3")。

更新: 现在我可以单独找到具有特定模式的匹配项,例如 ("I1","I2") 或 ("I1","I3")。

toMatch <- c("I1","I2")
matches <- grepRaw(toMatch,a_list,ignore.case = TRUE)
matches

结果:

> matches
[1] 4

一次性匹配"result"中的所有模式(我在上面的例子中手动输入了模式,但需要从"result"中提取)的问题待解决。并以如下形式输出:

Itemset     Freq
""I1","I2"" 4     
""I1","I3"" 4
""I2","I3"" 4

dplyr 包使这个操作清晰。

library(dplyr)
unlist(a_list) %>% table %>% data.frame

  unlist.a_list. Freq
1             I1    6
2             I2    7
3             I3    6
4             I4    2
5             I5    2

更新:

我不确定您要找的是什么,但这里是获取组合的方法:

Cols <- paste0("I",1:3)
p <- length(Cols)
id <- unlist(lapply(1:p, function(i) combn(1:p,i,simplify=F)), recursive=F)
formulas <- sapply(id,function(i) paste(Cols[i],collapse=","))

> formulas
[1] "I1"       "I2"       "I3"       "I1,I2"    "I1,I3"    "I2,I3"    "I1,I2,I3"

更新 2:

这应该可以满足您的需求:

library(dplyr)
a_list <- list(c("I1","I2","I5"),
           c("I2","I4"),
           c("I2","I3"),
           c("I1","I2","I4"),
           c("I1","I3"),
           c("I2","I3"),
           c("I1","I3"),
           c("I1","I2","I3","I5"),
           c("I1","I2","I3"))
a <- unlist(a_list) %>% table %>% data.frame
minsupport = 3
b <- data.frame(a)
c <- b[b$Freq > minsupport,]
d <- combn(c$.,2)
result <- unique(sapply(d,function(i) paste(d[,i],collapse=",")))
> result
[1] "I1,I2" "I1,I3" "I2,I3"

然后折叠你的 a_list 看起来像结果:

a.new.list <- sapply(a_list, paste, collapse=",")
> a.new.list
[1] "I1,I2,I5"    "I2,I4"       "I2,I3"       "I1,I2,I4"    "I1,I3"       "I2,I3"       "I1,I3"      
[8] "I1,I2,I3,I5" "I1,I2,I3" 

使用match函数并遍历所有结果:

hits <- sapply(1:length(result), function(j) match(a.new.list,result[j]))
colnames(hits) <- result
rownames(hits) <- a.new.list
> hits
            I1,I2 I1,I3 I2,I3
I1,I2,I5       NA    NA    NA
I2,I4          NA    NA    NA
I2,I3          NA    NA     1
I1,I2,I4       NA    NA    NA
I1,I3          NA     1    NA
I2,I3          NA    NA     1
I1,I3          NA     1    NA
I1,I2,I3,I5    NA    NA    NA
I1,I2,I3       NA    NA    NA

> apply(hits,2, sum, na.rm=TRUE)
I1,I2 I1,I3 I2,I3 
0     2     2