使用 stringr::str_detect 来检测字符串是否出现在一个字符出现 4 次之后

Question

不确定我的问题是否措辞得当，但这基本上是我想做的。

数据示例：

Data <- c("NELIG_Q1_1_C1_A", "NELIG_N1_1_EG1_B", "NELIG_V2_1_NTH_C", "NELIG_Q2_1_C5_Q",
"NELIG_N1_1_C1_RA", "NELIG_Q1_1_EG1_QR", "NELIG_V2_1_NTH_PQ", "NELIG_N2_1_C5_PRQ")

我想对最后一组字母组合使用 str_detect 进行过滤。在我要找的 string/pattern 之前总会有四个“_”，但在第四个“_”之后可能会有许多不同的字母组合。在上面的例子中，我试图只检测字母“Q”。

如果我做一个简单的 Data2 <- Data %>% filter(str_detect(column, "Q")) 我会得到字符串中任何位置都有 Q 的所有行。我怎样才能让它只关注最后一部分？

Answer 1

如果我对你的问题理解正确，那么你可以这样做：

library(stringr)
str_detect(Data, ".*_.*_.*_.*_.*Q.*$")
#R> [1] FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE

这将检测第四个“_”后是否有“Q”。

看标题：

detecting string after 4 constant characters

然后你可以像这样创建一个通用函数：

# returns TRUE if a certain character occurs after a character has been 
# there four times.
# 
# Args: 
#   x characters to check.
#   what character to occur at the end. 
#   after character to occur four times.
detect_after_four_times <- function(x, what, after){
  reg <- sprintf(".*%s.*%s.*%s.*%s.*%s.*$", after, after, after, after, 
                 what)
  str_detect(x, reg)
}

detect_after_four_times(Data, "Q", "_")
#R> [1] FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE
detect_after_four_times(Data, "R", "_") # look for R instead
#R> [1] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE

# also works if there are only three times of "after"
detect_after_four_times("only_three_dashes_Q", "Q", "_")
#R> [1] FALSE

Answer 2

如果你想使用 tidyverse:

library(magrittr)

data <- tibble::tibble(Col =  c("NELIG_Q1_1_C1_A", "NELIG_N1_1_EG1_B", 
                                "NELIG_V2_1_NTH_C", "NELIG_Q2_1_C5_Q",
                                "NELIG_N1_1_C1_RA", "NELIG_Q1_1_EG1_QR", 
                                "NELIG_V2_1_NTH_PQ", "NELIG_N2_1_C5_PRQ"))

data %>% 
  dplyr::mutate(Col = purrr::map_lgl(Col,
                                     ~ stringr::str_detect(
                                       unlist(
                                         stringr::str_split(.x, 
                                                            "_"))[5], 
                                       "Q")))
#> # A tibble: 8 x 1
#>   Col  
#>   <lgl>
#> 1 FALSE
#> 2 FALSE
#> 3 FALSE
#> 4 TRUE 
#> 5 FALSE
#> 6 TRUE 
#> 7 TRUE 
#> 8 TRUE

^{由 reprex package (v0.3.0)}

于 2020-11-05 创建

Answer 3

如果目标是 detect/match 那些在最后一个 _ 之后的 'section' 中包含 Q 的字符串，那么这个工作：

grep("_[A-Z]*Q[A-Z]*$", Data, value = T, perl = T)
[1] "NELIG_Q2_1_C5_Q"   "NELIG_Q1_1_EG1_QR" "NELIG_V2_1_NTH_PQ" "NELIG_N2_1_C5_PRQ"

或者，str_detect:

library(stringr)
str_detect(Data, "_[A-Z]*Q[A-Z]*$")
[1] FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE

数据：

Data <- c("NELIG_Q1_1_C1_A", "NELIG_N1_1_EG1_B", "NELIG_V2_1_NTH_C", "NELIG_Q2_1_C5_Q",
          "NELIG_N1_1_C1_RA", "NELIG_Q1_1_EG1_QR", "NELIG_V2_1_NTH_PQ", "NELIG_N2_1_C5_PRQ")

使用 stringr::str_detect 来检测字符串是否出现在一个字符出现 4 次之后

using stringr::str_detect to detect if a string appears after a character have appeared 4 times

r

stringr

stringi