R - 识别向量中的序列

R - identify sequences in a vector

假设我有一个包含 A 和 B 的向量 ab。我想识别序列并创建一个向量 v,其中 length(ab) 表示给定序列开头和结尾的序列长度,否则 NA

但是我有一个限制,即另一个带有 0/1 的向量 x 将指示序列结束。

例如:

rep("A", 6)

"A" "A" "A" "A" "A" "A"

x <- c(0,0,1,0,0,0)

0 0 1 0 0 0

应该给

v <- c(3 NA 3 3 NA 3)

示例如下:

ab <- c(rep("A", 5), "B", rep("A", 3))
"A" "A" "A" "A" "A" "B" "A" "A" "A"
x <- c(rep(0,3),1,0,1,rep(0,3))
0 0 0 1 0 1 0 0 0

这里的输出应该是:

4 NA NA 4 1 1 3 NA 3

(without the restriction it would be)
5 NA NA NA 5 1 3 NA 3

到目前为止,我没有限制的代码如下所示:

ab <- c(rep("A", 5), "B", rep("A", 3))
x <- c(rep(0,3),1,0,1,rep(0,3))

cng <- ab[-1L] != ab[-length(ab)] # is there a change in A and B w.r.t the previous value?
idx <- which(cng) # where do the  changes take place?
idx <- c(idx,length(ab)) # include the last value
seq_length <- diff(c(0, idx)) # how long are the sequences?

# create v
v <- rep(NA, length(ab))
v[idx] <- seq_length # sequence end
v[idx-(seq_length-1)] <- seq_length # sequence start
v

有人知道我如何实施限制吗? (而且由于我的向量有 200 万个观察值,我想知道是否有比我的方法更有效的方法) 如有任何意见,我将不胜感激!非常感谢!

你可以这样做


x <- c(rep(0,3),1,rep(0,2),1,rep(0,3))

ab <- c(rep("A", 5), "B", rep("A", 4))

#creating result of lengths
res <- as.numeric(ave(ab, rev(cumsum(rev(x))), FUN = function(z){with(rle(z), rep(lengths, lengths))}))

> res
 [1] 4 4 4 4 1 1 1 3 3 3

#creating intermediate NAs
replace(res, with(rle(res), setdiff(seq_along(res), c(length(res) + 1 - cumsum(rev(lengths)),
                                                      cumsum(lengths),
                                                      which(res == 1)))), NA)
 [1]  4 NA NA  4  1  1  1  3 NA  3

根据编辑的场景

x <- c(rep(0,3),1,rep(0,2),1,rep(0,3)) 
ab <- c(rep("A", 5), "B", rep("A", 4))
ab[3] <- 'B'

as.numeric(ave(ab, rev(cumsum(rev(x))), FUN = function(z){with(rle(z), rep(lengths, lengths))}))

 [1] 2 2 1 1 1 1 1 3 3 3

ab
 [1] "A" "A" "B" "A" "A" "B" "A" "A" "A" "A"