如何检测字符串中特定字符集的位置范围

Question

我有以下顺序：

my_seq <- "----?????-----?V?D????-------???IL??A?---"

我想做的是检测非虚线字符的位置范围。

----?????-----?V?D????-------???IL??A?---
|   |   |     |      |       |       |  
1   5   9    15     22      30      38

最终输出将是一个字符串向量：

out <- c("5-9", "15-22", "30-38")

我如何使用 R 实现这一点？

Answer 1

你可以这样做：

my_seq <- "----?????-----?V?D????-------???IL??A?---"

non_dash <- which(strsplit(my_seq, "")[[1]] != '-')
pos      <- non_dash[c(0, diff(non_dash)) != 1 | c(diff(non_dash), 0) != 1]

apply(matrix(pos, ncol = 2, byrow = TRUE), 1, function(x) paste(x, collapse = "-"))
#> [1] "5-9"   "15-22" "30-38"

^{由 reprex package (v2.0.1)}

于 2022-02-18 创建

Answer 2

受@lovalery 的精彩回答启发，base R 解决方案是：

g <- gregexpr(pattern = "[^-]+", my_seq)
d <-data.frame(start = unlist(g), 
           end = unlist(g) + attr(g[[1]], "match.length") - 1)
paste(s$start, s$end, sep ="-")
# [1] "1-5"   "11-18" "26-34"

Answer 3

尝试

paste0(gregexec('-\?', my_seq)[[1]][1,] + 1, '-',
       gregexec('\?-', my_seq)[[1]][1,])
#> [1] "5-9"   "15-22" "30-38"

Answer 4

这是一个 rle + tidyverse 方法：

library(dplyr)
with(rle(strsplit(my_seq, "")[[1]] != "-"),
     data.frame(lengths, values)) |>
  mutate(end = cumsum(lengths)) |>
  mutate(start =  1 + lag(end, 1,0)) |>
  mutate(rng = paste(start, end, sep = "-")) |>
  filter(values) |>
  pull(rng)

[1] "5-9"   "15-22" "30-38"

但是，如果您不介意安装 S4Vectors，代码可以变得非常简洁：

library(S4Vectors)

r <- Rle(strsplit(my_seq, "")[[1]] != "-")

paste(start(r), end(r), sep = "-")[runValue(r)]

[1] "5-9"   "15-22" "30-38"

Answer 5

请在下面找到使用 stringr 库的另一种可能的解决方案

Reprex

代码

library(stringr)

s <- as.data.frame(str_locate_all(my_seq, "[^-]+")[[1]])
result <- paste(s$start, s$end, sep ="-")

输出

result
#> [1] "5-9"   "15-22" "30-38"

^{由 reprex package (v2.0.1)}

于 2022-02-18 创建

Answer 6

A one-liner 以 utf8ToInt

为基数 R

apply(matrix(which(diff(c(FALSE, utf8ToInt(my_seq) != 45L, FALSE)) != 0) - 0:1, 2), 2, paste, collapse = "-")
#> [1] "5-9"   "15-22" "30-38"

如何检测字符串中特定字符集的位置范围

How to detect range of positions of specific set of characters in a string

regex

string

r

stringr

tidyverse