带有大数据的 R (arulesSequences) 中的 cSPADE 的奇怪结果。我可以强制 numpart 为 1 吗？有风险吗？

Question

我一直在尝试在我的交易文件中有大约 700 万条记录（700 万个唯一的 sequenceID x eventID 对）的数据集上使用 cSPADE。我在这个数据集上尝试运行 cSPADE 时得到的支持结果似乎完全错误。然而，当我使用 ~86,000 条记录（前一个文件的头部，或多或少）时，结果看起来是正确的。我注意到，到目前为止，详细日志打印出仅使用了 1 个分区，而当我尝试 ~850,000 条记录时，使用了 3 个分区。

使用 100,000 条记录时的详细输出（具有合理的结果）：

> s1 <- cspade(trans, parameter = list(support = 0.1,maxlen=1), control = list(verbose = TRUE))

parameter specification:
support : 0.1
maxsize :  10
maxlen  :   1

algorithmic control:
bfstype  : FALSE
verbose  :  TRUE
summary  : FALSE
tidLists : FALSE

preprocessing ... 1 partition(s), 1.98 MB [0.7s]
mining transactions ... 0 MB [0.21s]
reading sequences ... [0.03s]

total elapsed time: 0.94s

> summary(s1)
set of 14 sequences with

most frequent items:
      A       B       C       D       E (Other) 
      2       2       1       1       1       8 

.
.
.
summary of quality measures:
    support      
 Min.   :0.1306  
 1st Qu.:0.3701  
 Median :0.7021  
 Mean   :0.5773  
 3rd Qu.:0.7184  
 Max.   :0.9903  

includes transaction ID lists: FALSE 

mining info:
  data ntransactions nsequences support
 trans         83686      10059     0.1

使用 1000,000 条记录时的详细输出（结果看起来不对）：

> s1 <- cspade(trans, parameter = list(support = 0.1,maxlen=1), control = 
list(verbose = TRUE))

parameter specification:
support : 0.1
maxsize :  10
maxlen  :   1

algorithmic control:
bfstype  : FALSE
verbose  :  TRUE
summary  : FALSE
tidLists : FALSE

preprocessing ... 3 partition(s), 19.55 MB [4.6s]
mining transactions ... 0 MB [0.6s]
reading sequences ... [0.01s]

total elapsed time: 5.19s

> summary(s1)

set of 0 sequences with

most frequent items:
integer(0)

most frequent elements:
integer(0)

element (sequence) size distribution:
< table of extent 0 >

sequence length distribution:
< table of extent 0 >

summary of quality measures:
< table of extent 0 >

includes transaction ID lists: FALSE 

mining info:
  data ntransactions nsequences support
 trans        826830      96238     0.1

我发现我可以在调用 cSPADE 时将分区数设置为 1，这解决了问题。但是 cSPADE 确实输出警告说：

s1 <- cspade(trans, parameter = list(support = 0.1,maxlen=1), control = list(verbose = TRUE,numpart=1))

Warning message: In cspade(trans, parameter = list(support = 0.1, maxlen = 1), control = list(verbose = TRUE,  :  'numpart' less than recommended

我需要注意这个警告吗？设置 numpart=1（强制#partitions 为 1）有什么缺点？如果有，有什么方法可以让我在不控制这个参数的情况下得到正确的答案吗？

Answer 1

为了其他可能运行遇到同样问题的人的利益。我最终通过电子邮件将包裹发送给了作者。他说这不是已知问题，并建议我坚持使用 numpart=1。

带有大数据的 R (arulesSequences) 中的 cSPADE 的奇怪结果。我可以强制 numpart 为 1 吗？有风险吗？

Odd results from cSPADE in R (arulesSequences) w/ large data. Can I force numpart to 1? Are there risks?

r

sequence

apriori

arules