通过计算特定字符来子集字符串

Subset string by counting specific characters

我有以下字符串:

strings <- c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG") 

我想把字符串截断,只要A,G,N出现的次数达到一定的值,比如3次,那么结果应该是:

some_function(strings)

c("ABBSDGN", "AABSDG", "AGN", "GGG") 

我尝试使用 stringistringr 和正则表达式,但我无法理解。

这是使用 strsplit

的基础 R 选项
sapply(strsplit(strings, ""), function(x)
    paste(x[1:which.max(cumsum(x %in% c("A", "G", "N")) == 3)], collapse = ""))
#[1] "ABBSDGN" "AABSDG"  "AGN"     "GGG"

或在tidyverse

library(tidyverse)
map_chr(str_split(strings, ""), 
    ~str_c(.x[1:which.max(cumsum(.x %in% c("A", "G", "N")) == 3)], collapse = ""))

使用 gregexpr 识别模式的位置,然后提取第 n 个位置 (3) 并使用 subset1 到第 n 个位置的所有子字符串。

nChars <- 3
pattern <- "A|G|N"
# Using sapply to iterate over strings vector
sapply(strings, function(x) substr(x, 1, gregexpr(pattern, x)[[1]][nChars]))

PS:

如果有一个字符串没有 3 个匹配项,它将生成 NA,因此您只需要在最终结果上使用 na.omit

有趣的问题。我创建了一个函数(见下文)来解决您的问题。假定您的任何字符串中只有字母,没有特殊字符。

 reduce_strings = function(str, chars, cnt){

  # Replacing chars in str with "!"
  chars = paste0(chars, collapse = "")
  replacement = paste0(rep("!", nchar(chars)), collapse = "")
  str_alias = chartr(chars, replacement, str) 

  # Obtain indices with ! for each string
  idx = stringr::str_locate_all(pattern = '!', str_alias)

  # Reduce each string in str
  reduce = function(i) substr(str[i], start = 1, stop = idx[[i]][cnt, 1])
  result = vapply(seq_along(str), reduce, "character")
  return(result)
}

# Example call
str = c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG") 
chars = c("A", "G", "N") # Characters that are counted
cnt = 3 # Count of the characters, at which the strings are cut off
reduce_strings(str, chars, cnt) # "ABBSDGN" "AABSDG" "AGN" "GGG"

这只是一个没有 strsplit Maurits Evers 的版本。

sapply(strings,
       function(x) {
         raw <- rawToChar(charToRaw(x), multiple = TRUE)
         idx <- which.max(cumsum(raw %in% c("A", "G", "N")) == 3)
         paste(raw[1:idx], collapse = "")
       })
## ABBSDGNHNGA   AABSDGDRY      AGNAFG  GGGDSRTYHG 
##   "ABBSDGN"    "AABSDG"       "AGN"       "GGG"

或者,略有不同,没有 strsplitpaste

test <- charToRaw("AGN")
sapply(strings,
       function(x) {
         raw <- charToRaw(x)
         idx <- which.max(cumsum(raw %in% test) == 3)
         rawToChar(raw[1:idx])
       })

您可以通过从 stringr 包简单地调用 str_extract 来完成您的任务:

library(stringr)

strings <- c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")

str_extract(strings, '([^AGN]*[AGN]){3}')
# [1] "ABBSDGN" "AABSDG"  "AGN"     "GGG"

正则表达式模式的 [^AGN]*[AGN] 部分表示要查找零个或多个不是 A、G 或 N 的连续字符,后跟一个 A、G 或 N 实例。额外的换行带有圆括号和大括号,如 ([^AGN]*[AGN]){3},表示连续三次查找该模式。您可以通过更改花括号中的整数来更改要查找的 A、G、N 的出现次数:

str_extract(strings, '([^AGN]*[AGN]){4}')
# [1] "ABBSDGNHN"  NA           "AGNA"       "GGGDSRTYHG"

有几种方法可以使用基本 R 函数完成您的任务。一种是使用 regexpr 后跟 regmatches:

m <- regexpr('([^AGN]*[AGN]){3}', strings)
regmatches(strings, m)
# [1] "ABBSDGN" "AABSDG"  "AGN"     "GGG"

或者,您可以使用 sub:

sub('(([^AGN]*[AGN]){3}).*', '\1', strings)
# [1] "ABBSDGN" "AABSDG"  "AGN"     "GGG"