R arulesSequences - 交易中存在哪些频繁序列?需要更通用的方法

R arulesSequences - which frequent sequences are present in a transaction? A more generic approach wanted

较早的问题
中,我询问了如何提取所谓的 tidList,它提供有关找到的频繁序列是否存在于用于挖掘这些频繁序列的每个事务中的信息。更具体地说,如何以行顺序与原始交易数据集中相同的方式提取布尔矩阵(表示序列的存在或不存在)?
最终,通过使用存储在 class sequences.

对象中的 tidList 的 transactionInfo 属性,这变得非常容易。

新问题
这个问题与之前的问题有点不同:给定一组频繁序列,我如何 'score' 新交易。 IE。给定 sequences 类型的对象,如何从 transactions 类型的新对象中获取 tidList 类型的对象?

为了说明这一点,我使用一些玩具数据集设计了一个示例:

library(arules)
library(arulesSequences)
library(stringr)

#Function used to convert character string into an object of type transactions. 
#Source: https://github.com/cran/clickstream/blob/master/R/Clickstream.r.
as.transactions <- function(clickstreamList) {  
      transactionID   <- unlist(lapply(seq(1, length(clickstreamList), 1), FUN = 
                              function(x) rep(names(clickstreamList)[x], length(clickstreamList[[x]]))), use.names = F)
      sequenceID      <- unlist(lapply(seq(1, length(clickstreamList), 1), FUN = 
                                function(x) rep(x, length(clickstreamList[[x]]))))
      eventID         <- unlist(lapply(clickstreamList, FUN = function(x) 
                                1:length(x)), use.names = F) 
      transactionInfo <- data.frame(transactionID, sequenceID, eventID)

      tr <- as(as.data.frame(unlist(clickstreamList, use.names = F)), "transactions")

  transactionInfo(tr) <- transactionInfo
  itemInfo(tr)$labels <- itemInfo(tr)$levels 
  return(tr)

}

#Dataset to mine frequent sequences from
data_mine_freq_seq <- data.frame(id = 1:10,
                                 transaction = c("A B B A",
                                                 "A B C B D C B B B F A",
                                                 "A A B",
                                                 "B A B A",
                                                 "A B B B B",
                                                 "A A A B",
                                                 "A B B A B B",
                                                 "E F F A C B D A B C D E",
                                                 "A B B A B",
                                                 "A B")) 

#Convert data to list containing character vectors
data_for_fseq_mining        <- str_split(string = data_mine_freq_seq$transaction, pattern = " ")  
#Include identifiers as names 
names(data_for_fseq_mining) <- data_mine_freq_seq$id
#Convert to object of type transactions
data_for_fseq_mining_trans  <- as.transactions(clickstreamList = data_for_fseq_mining)

#Mine frequent sequences with cspade, given some parameters.
sequences <- cspade(data      = data_for_fseq_mining_trans, 
                    parameter = list(support = 0.10, maxlen = 4, maxgap = 2),
                    control   = list(tidList = TRUE, verbose = TRUE))

#Create a data frame that contains all sequences and their support (167 sequences in total).
sequences_df <- cbind(sequence  = labels(sequences), 
                      support   = sequences@quality)

现在我创建了一个只包含一个交易的新数据集:

data_score             <- data.frame(id = 11, transaction = "A B B C D A")
#Convert data to list containing character vectors
data_score_list        <- str_split(string = data_score$transaction, pattern = " ")  
#Include identifier as name
names(data_score_list) <- data_score$id
#Convert to object of type transactions
data_score_trans       <- as.transactions(clickstreamList =  data_score_list)

如何找出对象 序列 中包含的哪些频繁序列出现在 'data_score_trans' 中?

编辑 我尝试了以下代码行:

supportingTransactions(x = sequences, transactions = data_score_trans)

产生预期和期望的结果:

tidLists in sparse format with
 167 items/itemsets (rows) and
 1 transactions (columns)

但是当新的交易中包含原始数据集中不存在的元素时,就会出现错误:

#Added a 'G' at the end of the transaction. Element 'G' is not an element in
#'data_mine_freq_seq'.
data_score             <- data.frame(id = 11, transaction = "A B B C D A G")
#Convert data to list containing character vectors
data_score_list        <- str_split(string = data_score$transaction, pattern = " ")  
#Include identifier as name
names(data_score_list) <- data_score$id
#Convert to object of type transactions
data_score_trans       <- as.transactions(clickstreamList =  data_score_list)

#Score 'data_score_trans' using 'sequences' again:
supportingTransactions(x = sequences, transactions = data_score_trans)

Error in rbind(deparse.level, ...) : 
  numbers of columns of arguments do not match

如何解决?

我想出了一个利用正则表达式强大功能的解决方法。我定义了以下函数:

score_pattern <- function(pattern, events){

  regex_elements <- str_extract_all(string = pattern, pattern = "\{.*?\}")
  regex_elements <- str_replace_all(string = unlist(regex_elements), 
                                    pattern = "\{|\}", replacement = "")
  expr           <- ""

    for(i in 1:length(regex_elements)){

      if(i == 1){
        expr <- paste0(expr, "(^| )", regex_elements[i], collapse = "") 
      } else {
        expr <- paste0(expr, "( | .*? )", regex_elements[i], collapse = "") 
      } 
    }

  expr <- paste0(expr, "( |$)", collapse = "")

  print(expr)
  score_pattern  <- ifelse(test = grepl(pattern = expr, x = events) == TRUE, 
                           yes =  1, no = 0)
  return(score_pattern)

}

为了说明它的用途。这是我使用对象 'sequences_df'(从列 'sequence' 中选择一个序列)和 'data_score'、列 'transaction':

中的交易数据的示例
score_pattern(pattern = "<{B},{A}>", events = data_score$transaction)
[1] "(^| )B( | .*? )A( |$)"
[1] 1

函数 returns 一个包含 0 和 1 的数字向量,指示序列是否存在于所提供的交易中(1 = 是,0 = 否)。

虽然这是一个解决方案,但它只是针对没有对序列中连续元素之间的最大间隙应用限制的情况的解决方案。例如。创建的正则表达式没有 'maxgap' 参数。结论:这只有在cspade算法中的参数'maxgap'没有设置的情况下才会起作用