R加速字符串分解

Question

我对 R 比较陌生，所以我的命令库有限。

我正在尝试编写一个脚本，该脚本将包含在文本字符串中并用“>”符号分隔的一系列马尔可夫序列分解为偶然事件 "from - to" table。

附带的代码和虚拟数据是我能够获取代码的地方。在包含的小 7 案例示例中，这将运行相对较快。然而，现实情况是我有数百万个案例需要解析，而我的代码不够高效，无法及时处理（花了一个多小时，这个时间框架不可行）。

我相信有一种更有效的方法来构造此代码，以便它可以快速执行，因为我已经看到在几分钟内在其他马尔可夫包中执行了此操作。我需要自己的脚本版本，但可以灵活处理，因此我没有求助于这些。

我想请求的是改进脚本以提高处理效率。

Seq   <- c('A>B>C>D', 'A>B>C', 'A', 'A', 'B', 'B>D>C', 'D') #7 cases
Lives <- c(0,0,0,0,1,1,0)

Seqdata <- data.frame(Seq, Lives)

Seqdata$Seq <- gsub("\s", "", Seqdata$Seq)

fromstep  <- list()
tostep    <- list()

##ORDER 1##
for (x in 1:nrow(Seqdata)) {
  steps <- unlist(strsplit(Seqdata$Seq[x], ">"))
  for (i in 1:length(steps)) {

    if (i==1) {fromstep <- c(fromstep, "Start")
    tostep   <- c(tostep, steps[i])
    }

    fromstep <- c(fromstep, steps[i])    

    if (i<length(steps)) {
      tostep   <- c(tostep, steps[i+1])
    } else if (Seqdata$Lives[x] == 1) {
      tostep   <- c(tostep, 'Lives')
    } else
      tostep    <- c(tostep, 'Dies')
  }
}

transition.freq <- table(unlist(fromstep), unlist(tostep))
transition.freq

Answer 1

我不熟悉马尔可夫序列，但这会产生相同的输出：

xx <- strsplit(Seqdata$Seq, '>', fixed=TRUE)
table(From=unlist(lapply(xx, append, 'Start', 0L)),
      To=unlist(mapply(c, xx, ifelse(Seqdata$Lives == 0L, 'Dies', 'Lives'))))

R加速字符串分解

R Speed up string decomposition

r

markov-chains

processing-efficiency