R序列和事件问题中的模式序列

Question

我正在尝试使用 R (SPADE) 中的频繁序列。我有以下数据集：

d1 <- c(1:10)
d2 <- c("nut", "bolt", "screw")
data <- data.frame(expand.grid(d1,d2))
data$status <- sample(c("a","b","c"), size = nrow(data), replace = TRUE)
colnames(data) <- c("day", "widget", "status")

   day widget status
1    1    nut      c
2    2    nut      b
3    3    nut      b
4    4    nut      b
5    5    nut      a
6    6    nut      a
7    7    nut      b
8    8    nut      c
9    9    nut      c
10  10    nut      b
11   1   bolt      a
12   2   bolt      b
...

我无法将数据转换成似乎适用于各种可用软件包的格式。我认为基本问题是大多数包都希望有与身份和事件相关联的序列。在我的情况下不存在。

我想回答以下问题：

如果在任何一天小部件[螺栓]的状态是"a"并且小部件[螺丝]是"c"并且在第二天小部件[螺丝]是"b"然后在第 3 天 widget[nut] 很可能是 "a".

所以没有身份或者transaction/event可以使用。我是不是把这个问题复杂化了？或者是否有一个非常适合这个的包。到目前为止，我已经尝试过 arulesSequence 和 TraMineR。

谢谢

Answer 1

我认为您会发现通过将数据从长数据重塑为宽数据，然后实施逻辑测试，最容易解决此类问题。例如：

# reshape from long to wide
data2 <- reshape2::dcast(data, day ~ widget)

# get the next-rows's value for "nut"
data2$next_nut <- dplyr::lead(data2$nut)

# implement your test 
data2$bolt == "a" & data2$screw == "c" & data2$next_nut == "a"

Answer 2

此处的关键是 根据您的 objective 重塑您的数据集。你必须确保每一行都有所有的输入信息（你的criteria/conditions）和目标变量（你想知道的）。

根据您描述的问题：

输入信息是"widget[bolt] value at a given day, widget[screw] value at the same given day and on widget[screw] value the day after"，因此您需要确保新数据集的每一行都有此信息。

目标信息 是“第 3 天小部件[坚果]值”。

# for reproducibility reasons
set.seed(16)  

# example dataset
d1 <- c(1:100)
d2 <- c("nut", "bolt", "screw")
data <- data.frame(expand.grid(d1,d2))
data$status <- sample(c("a","b","c"), size = nrow(data), replace = TRUE)
colnames(data) <- c("day", "widget", "status")

library(tidyverse)

data %>% 
  spread(widget, status) %>%             # reshape data
  mutate(screw_next_1 = lead(screw),     # add screw next day
         nut_next_2 = lead(nut, 2)) %>%  # add nut 2 days after (target variable)
  filter(bolt == "a" & screw == "c" & screw_next_1 == "b") # get rows that satisfy your criteria

#   day nut bolt screw screw_next_1 nut_next_2
# 1   8   c    a     c            b          a
# 2  19   c    a     c            b          c
# 3  62   c    a     c            b          c
# 4  97   c    a     c            b          b

通过简单的计算，您可以说根据数据您有 nut = a 第 3 天的概率，给定您的标准，是 1/4。

Answer 3

不确定你想做什么。如果您想使用 TraMineR，假设小部件是您的序列 ID，您可以通过以下方式输入数据：

library(TraMineR)

## Transforming into the STS form expected by seqdef()
sts.data <- seqformat(data, from="SPELL", to="STS", id="widget", 
                      begin="day", end="day", status="status",
                      limit=10)

## Setting position names and sequence names
names(sts.data) <- paste0("d",rep(1:10))
rownames(sts.data) <- d2
sts.data
#       d1 d2 d3 d4 d5 d6 d7 d8 d9 d10
# nut    b  a  b  b  b  a  c  a  a   a
# bolt   c  b  a  b  a  c  b  a  c   c
# screw  a  b  a  a  c  c  b  b  b   c

## Creating the state sequence object
sseq <- seqdef(sts.data)

## Potting the sequences
seqiplot(sseq, ytlab="id", ncol=3)

R序列和事件问题中的模式序列

Sequence of patterns in R sequence and events issues

r

arules

traminer