R - 识别向量中的序列
R - identify sequences in a vector
假设我有一个包含 A 和 B 的向量 ab
。我想识别序列并创建一个向量 v
,其中 length(ab)
表示给定序列开头和结尾的序列长度,否则 NA
。
但是我有一个限制,即另一个带有 0/1 的向量 x
将指示序列结束。
例如:
rep("A", 6)
"A" "A" "A" "A" "A" "A"
x <- c(0,0,1,0,0,0)
0 0 1 0 0 0
应该给
v <- c(3 NA 3 3 NA 3)
示例如下:
ab <- c(rep("A", 5), "B", rep("A", 3))
"A" "A" "A" "A" "A" "B" "A" "A" "A"
x <- c(rep(0,3),1,0,1,rep(0,3))
0 0 0 1 0 1 0 0 0
这里的输出应该是:
4 NA NA 4 1 1 3 NA 3
(without the restriction it would be)
5 NA NA NA 5 1 3 NA 3
到目前为止,我没有限制的代码如下所示:
ab <- c(rep("A", 5), "B", rep("A", 3))
x <- c(rep(0,3),1,0,1,rep(0,3))
cng <- ab[-1L] != ab[-length(ab)] # is there a change in A and B w.r.t the previous value?
idx <- which(cng) # where do the changes take place?
idx <- c(idx,length(ab)) # include the last value
seq_length <- diff(c(0, idx)) # how long are the sequences?
# create v
v <- rep(NA, length(ab))
v[idx] <- seq_length # sequence end
v[idx-(seq_length-1)] <- seq_length # sequence start
v
有人知道我如何实施限制吗? (而且由于我的向量有 200 万个观察值,我想知道是否有比我的方法更有效的方法)
如有任何意见,我将不胜感激!非常感谢!
你可以这样做
x <- c(rep(0,3),1,rep(0,2),1,rep(0,3))
ab <- c(rep("A", 5), "B", rep("A", 4))
#creating result of lengths
res <- as.numeric(ave(ab, rev(cumsum(rev(x))), FUN = function(z){with(rle(z), rep(lengths, lengths))}))
> res
[1] 4 4 4 4 1 1 1 3 3 3
#creating intermediate NAs
replace(res, with(rle(res), setdiff(seq_along(res), c(length(res) + 1 - cumsum(rev(lengths)),
cumsum(lengths),
which(res == 1)))), NA)
[1] 4 NA NA 4 1 1 1 3 NA 3
根据编辑的场景
x <- c(rep(0,3),1,rep(0,2),1,rep(0,3))
ab <- c(rep("A", 5), "B", rep("A", 4))
ab[3] <- 'B'
as.numeric(ave(ab, rev(cumsum(rev(x))), FUN = function(z){with(rle(z), rep(lengths, lengths))}))
[1] 2 2 1 1 1 1 1 3 3 3
ab
[1] "A" "A" "B" "A" "A" "B" "A" "A" "A" "A"
假设我有一个包含 A 和 B 的向量 ab
。我想识别序列并创建一个向量 v
,其中 length(ab)
表示给定序列开头和结尾的序列长度,否则 NA
。
但是我有一个限制,即另一个带有 0/1 的向量 x
将指示序列结束。
例如:
rep("A", 6)
"A" "A" "A" "A" "A" "A"
x <- c(0,0,1,0,0,0)
0 0 1 0 0 0
应该给
v <- c(3 NA 3 3 NA 3)
示例如下:
ab <- c(rep("A", 5), "B", rep("A", 3))
"A" "A" "A" "A" "A" "B" "A" "A" "A"
x <- c(rep(0,3),1,0,1,rep(0,3))
0 0 0 1 0 1 0 0 0
这里的输出应该是:
4 NA NA 4 1 1 3 NA 3
(without the restriction it would be)
5 NA NA NA 5 1 3 NA 3
到目前为止,我没有限制的代码如下所示:
ab <- c(rep("A", 5), "B", rep("A", 3))
x <- c(rep(0,3),1,0,1,rep(0,3))
cng <- ab[-1L] != ab[-length(ab)] # is there a change in A and B w.r.t the previous value?
idx <- which(cng) # where do the changes take place?
idx <- c(idx,length(ab)) # include the last value
seq_length <- diff(c(0, idx)) # how long are the sequences?
# create v
v <- rep(NA, length(ab))
v[idx] <- seq_length # sequence end
v[idx-(seq_length-1)] <- seq_length # sequence start
v
有人知道我如何实施限制吗? (而且由于我的向量有 200 万个观察值,我想知道是否有比我的方法更有效的方法) 如有任何意见,我将不胜感激!非常感谢!
你可以这样做
x <- c(rep(0,3),1,rep(0,2),1,rep(0,3))
ab <- c(rep("A", 5), "B", rep("A", 4))
#creating result of lengths
res <- as.numeric(ave(ab, rev(cumsum(rev(x))), FUN = function(z){with(rle(z), rep(lengths, lengths))}))
> res
[1] 4 4 4 4 1 1 1 3 3 3
#creating intermediate NAs
replace(res, with(rle(res), setdiff(seq_along(res), c(length(res) + 1 - cumsum(rev(lengths)),
cumsum(lengths),
which(res == 1)))), NA)
[1] 4 NA NA 4 1 1 1 3 NA 3
根据编辑的场景
x <- c(rep(0,3),1,rep(0,2),1,rep(0,3))
ab <- c(rep("A", 5), "B", rep("A", 4))
ab[3] <- 'B'
as.numeric(ave(ab, rev(cumsum(rev(x))), FUN = function(z){with(rle(z), rep(lengths, lengths))}))
[1] 2 2 1 1 1 1 1 3 3 3
ab
[1] "A" "A" "B" "A" "A" "B" "A" "A" "A" "A"