R源代码中的Apriori算法
Apriori algorithm in R source code
我正在尝试用 R 代码编写先验算法。首先,我想计算列表中每个项目的频率。我的初始代码如下:
a_list <- list(c("I1","I2","I5"),
c("I2","I4"),
c("I2","I3"),
c("I1","I2","I4"),
c("I1","I3"),
c("I2","I3"),
c("I1","I3"),
c("I1","I2","I3","I5"),
c("I1","I2","I3"))
sapply(a_list, function(x) length(x))
un <- unique(unlist(a_list))
nm <- lapply(un, function(x) sapply(a_list, function(y) sum(y == x)))
names(nm) <- un
nm
我得到的结果是:
> nm
$I1
[1] 1 0 0 1 1 0 1 1 1
$I2
[1] 1 1 1 1 0 1 0 1 1
$I5
[1] 1 0 0 0 0 0 0 1 0
$I4
[1] 0 1 0 1 0 0 0 0 0
$I3
[1] 0 0 1 0 1 1 1 1 1
但是,我希望它被安排为(也许重新排列在矩阵或数组中,然后我可以进一步处理它):
> nm
I1 6
I2 7
I3 6
I4 2
I5 2
每个项目显示频率计数并按字母顺序排列。有什么办法可以实现吗?我尝试了 cbind、apply、relist,但还没有找到解决方案。谢谢
更新:
library(dplyr)
a_list <- list(c("I1","I2","I5"),
c("I2","I4"),
c("I2","I3"),
c("I1","I2","I4"),
c("I1","I3"),
c("I2","I3"),
c("I1","I3"),
c("I1","I2","I3","I5"),
c("I1","I2","I3"))
a <- unlist(a_list) %>% table %>% data.frame
a
minsupport = 3
b <- data.frame(a)
c <- b[b$Freq > minsupport,]
c
现在我得到的结果是:
> a
. Freq
1 I1 6
2 I2 7
3 I3 6
4 I4 2
5 I5 2
> c
. Freq
1 I1 6
2 I2 7
3 I3 6
然后如何通过扫描原始列表设置 "I1,I2", ...,"I2,I3" 的组合?
更新日期:
我尝试了如下组合,它输出了一个矩阵。
> combn(c$.,2)
[,1] [,2] [,3]
[1,] I1 I1 I2
[2,] I2 I3 I3
Levels: I1 I2 I3 I4 I5
进一步修改为:
d <- combn(c$.,2)
result <- unique(sapply(d,function(i) paste(d[,i],collapse=",")))
result
我的结果是:
> result
[1] "I1,I2" "I1,I3" "I2,I3"
接下来是从原始 "a_list" 中统计上述项目集的频率。也许输出为
更好
""I1","I2"", ""I1","I3"", ""I2","I3""
为了与原榜单进行比较
如何从原始 a_list 中获取此矩阵中项目集的频率?
先验算法要求扫描所有不小于最小支持度的项集,从1维开始(即"I1"、"I2"、...、"I5" in a_list)到2维(即 "I1,I2" "I1,I3" "I2,I3" 在这种情况下),然后继续,如果适用(例如 "I1,I2,I3")。
更新:
现在我可以单独找到具有特定模式的匹配项,例如 ("I1","I2") 或 ("I1","I3")。
toMatch <- c("I1","I2")
matches <- grepRaw(toMatch,a_list,ignore.case = TRUE)
matches
结果:
> matches
[1] 4
一次性匹配"result"中的所有模式(我在上面的例子中手动输入了模式,但需要从"result"中提取)的问题待解决。并以如下形式输出:
Itemset Freq
""I1","I2"" 4
""I1","I3"" 4
""I2","I3"" 4
dplyr
包使这个操作清晰。
library(dplyr)
unlist(a_list) %>% table %>% data.frame
unlist.a_list. Freq
1 I1 6
2 I2 7
3 I3 6
4 I4 2
5 I5 2
更新:
我不确定您要找的是什么,但这里是获取组合的方法:
Cols <- paste0("I",1:3)
p <- length(Cols)
id <- unlist(lapply(1:p, function(i) combn(1:p,i,simplify=F)), recursive=F)
formulas <- sapply(id,function(i) paste(Cols[i],collapse=","))
> formulas
[1] "I1" "I2" "I3" "I1,I2" "I1,I3" "I2,I3" "I1,I2,I3"
更新 2:
这应该可以满足您的需求:
library(dplyr)
a_list <- list(c("I1","I2","I5"),
c("I2","I4"),
c("I2","I3"),
c("I1","I2","I4"),
c("I1","I3"),
c("I2","I3"),
c("I1","I3"),
c("I1","I2","I3","I5"),
c("I1","I2","I3"))
a <- unlist(a_list) %>% table %>% data.frame
minsupport = 3
b <- data.frame(a)
c <- b[b$Freq > minsupport,]
d <- combn(c$.,2)
result <- unique(sapply(d,function(i) paste(d[,i],collapse=",")))
> result
[1] "I1,I2" "I1,I3" "I2,I3"
然后折叠你的 a_list 看起来像结果:
a.new.list <- sapply(a_list, paste, collapse=",")
> a.new.list
[1] "I1,I2,I5" "I2,I4" "I2,I3" "I1,I2,I4" "I1,I3" "I2,I3" "I1,I3"
[8] "I1,I2,I3,I5" "I1,I2,I3"
使用match
函数并遍历所有结果:
hits <- sapply(1:length(result), function(j) match(a.new.list,result[j]))
colnames(hits) <- result
rownames(hits) <- a.new.list
> hits
I1,I2 I1,I3 I2,I3
I1,I2,I5 NA NA NA
I2,I4 NA NA NA
I2,I3 NA NA 1
I1,I2,I4 NA NA NA
I1,I3 NA 1 NA
I2,I3 NA NA 1
I1,I3 NA 1 NA
I1,I2,I3,I5 NA NA NA
I1,I2,I3 NA NA NA
> apply(hits,2, sum, na.rm=TRUE)
I1,I2 I1,I3 I2,I3
0 2 2
我正在尝试用 R 代码编写先验算法。首先,我想计算列表中每个项目的频率。我的初始代码如下:
a_list <- list(c("I1","I2","I5"),
c("I2","I4"),
c("I2","I3"),
c("I1","I2","I4"),
c("I1","I3"),
c("I2","I3"),
c("I1","I3"),
c("I1","I2","I3","I5"),
c("I1","I2","I3"))
sapply(a_list, function(x) length(x))
un <- unique(unlist(a_list))
nm <- lapply(un, function(x) sapply(a_list, function(y) sum(y == x)))
names(nm) <- un
nm
我得到的结果是:
> nm
$I1
[1] 1 0 0 1 1 0 1 1 1
$I2
[1] 1 1 1 1 0 1 0 1 1
$I5
[1] 1 0 0 0 0 0 0 1 0
$I4
[1] 0 1 0 1 0 0 0 0 0
$I3
[1] 0 0 1 0 1 1 1 1 1
但是,我希望它被安排为(也许重新排列在矩阵或数组中,然后我可以进一步处理它):
> nm
I1 6
I2 7
I3 6
I4 2
I5 2
每个项目显示频率计数并按字母顺序排列。有什么办法可以实现吗?我尝试了 cbind、apply、relist,但还没有找到解决方案。谢谢
更新:
library(dplyr)
a_list <- list(c("I1","I2","I5"),
c("I2","I4"),
c("I2","I3"),
c("I1","I2","I4"),
c("I1","I3"),
c("I2","I3"),
c("I1","I3"),
c("I1","I2","I3","I5"),
c("I1","I2","I3"))
a <- unlist(a_list) %>% table %>% data.frame
a
minsupport = 3
b <- data.frame(a)
c <- b[b$Freq > minsupport,]
c
现在我得到的结果是:
> a
. Freq
1 I1 6
2 I2 7
3 I3 6
4 I4 2
5 I5 2
> c
. Freq
1 I1 6
2 I2 7
3 I3 6
然后如何通过扫描原始列表设置 "I1,I2", ...,"I2,I3" 的组合?
更新日期: 我尝试了如下组合,它输出了一个矩阵。
> combn(c$.,2)
[,1] [,2] [,3]
[1,] I1 I1 I2
[2,] I2 I3 I3
Levels: I1 I2 I3 I4 I5
进一步修改为:
d <- combn(c$.,2)
result <- unique(sapply(d,function(i) paste(d[,i],collapse=",")))
result
我的结果是:
> result
[1] "I1,I2" "I1,I3" "I2,I3"
接下来是从原始 "a_list" 中统计上述项目集的频率。也许输出为
更好""I1","I2"", ""I1","I3"", ""I2","I3""
为了与原榜单进行比较
如何从原始 a_list 中获取此矩阵中项目集的频率? 先验算法要求扫描所有不小于最小支持度的项集,从1维开始(即"I1"、"I2"、...、"I5" in a_list)到2维(即 "I1,I2" "I1,I3" "I2,I3" 在这种情况下),然后继续,如果适用(例如 "I1,I2,I3")。
更新: 现在我可以单独找到具有特定模式的匹配项,例如 ("I1","I2") 或 ("I1","I3")。
toMatch <- c("I1","I2")
matches <- grepRaw(toMatch,a_list,ignore.case = TRUE)
matches
结果:
> matches
[1] 4
一次性匹配"result"中的所有模式(我在上面的例子中手动输入了模式,但需要从"result"中提取)的问题待解决。并以如下形式输出:
Itemset Freq
""I1","I2"" 4
""I1","I3"" 4
""I2","I3"" 4
dplyr
包使这个操作清晰。
library(dplyr)
unlist(a_list) %>% table %>% data.frame
unlist.a_list. Freq
1 I1 6
2 I2 7
3 I3 6
4 I4 2
5 I5 2
更新:
我不确定您要找的是什么,但这里是获取组合的方法:
Cols <- paste0("I",1:3)
p <- length(Cols)
id <- unlist(lapply(1:p, function(i) combn(1:p,i,simplify=F)), recursive=F)
formulas <- sapply(id,function(i) paste(Cols[i],collapse=","))
> formulas
[1] "I1" "I2" "I3" "I1,I2" "I1,I3" "I2,I3" "I1,I2,I3"
更新 2:
这应该可以满足您的需求:
library(dplyr)
a_list <- list(c("I1","I2","I5"),
c("I2","I4"),
c("I2","I3"),
c("I1","I2","I4"),
c("I1","I3"),
c("I2","I3"),
c("I1","I3"),
c("I1","I2","I3","I5"),
c("I1","I2","I3"))
a <- unlist(a_list) %>% table %>% data.frame
minsupport = 3
b <- data.frame(a)
c <- b[b$Freq > minsupport,]
d <- combn(c$.,2)
result <- unique(sapply(d,function(i) paste(d[,i],collapse=",")))
> result
[1] "I1,I2" "I1,I3" "I2,I3"
然后折叠你的 a_list 看起来像结果:
a.new.list <- sapply(a_list, paste, collapse=",")
> a.new.list
[1] "I1,I2,I5" "I2,I4" "I2,I3" "I1,I2,I4" "I1,I3" "I2,I3" "I1,I3"
[8] "I1,I2,I3,I5" "I1,I2,I3"
使用match
函数并遍历所有结果:
hits <- sapply(1:length(result), function(j) match(a.new.list,result[j]))
colnames(hits) <- result
rownames(hits) <- a.new.list
> hits
I1,I2 I1,I3 I2,I3
I1,I2,I5 NA NA NA
I2,I4 NA NA NA
I2,I3 NA NA 1
I1,I2,I4 NA NA NA
I1,I3 NA 1 NA
I2,I3 NA NA 1
I1,I3 NA 1 NA
I1,I2,I3,I5 NA NA NA
I1,I2,I3 NA NA NA
> apply(hits,2, sum, na.rm=TRUE)
I1,I2 I1,I3 I2,I3
0 2 2