正则表达式问题（检查某些重复的字符串）

Question

我想检查文本中是否有 a) 连续三个辅音或 b) 连续四个相同的字母。有人可以帮我解决正则表达式问题吗？

library(tidyverse)

df <- data.frame(text = c("Completely valid", "abcdefg", "blablabla", "flahaaaa", "asdf", "another text", "a last one", "sj", "ngbas"))

consonants <- c("q", "w", "r", "t", "z", "p", "s", "d", "f", "g", "h", "k", "l", "m", "n", "b", "x")

df %>% mutate(
         invalid = FALSE, 
         # Length too short
         invalid = ifelse(nchar(text)<3, TRUE, invalid),
         # Contains three consonants in a row: e.g. "ngbas"
         invalid = ifelse(str_detect(text,"???"),  TRUE, FALSE),   # <--- Regex missing
         # More than 3 identical characters in a row: e.g. "flahaaaa" 
         invalid = ifelse(str_detect(text,"???"),  TRUE, FALSE)    # <--- Regex missing
       )

Answer 1

连续三个辅音：

[qwrtzpsdfghklmnbx]{3}

特定字符的长度 > 3 的序列：

([a-z])(\1){3}
    # The double backslash occurs due to its role as the escape character in strings.

后者使用反向引用。该数字表示分配给所引用的捕获组（= 括号中的表达式）的序号 - 在本例中为拉丁小写字母的字符 class。

为清楚起见，此处不考虑字符大小写。

没有反向引用，解决方案有点冗长：

(aaaa|bbbb|cccc|dddd|eeee|ffff|gggg|hhhh|iiii|jjjj|kkkk|llll|mmmm|nnnn|oooo|pppp|qqqq|rrrr|ssss|tttt|uuuu|vvvv|wwww|xxxx|yyyy|zzzz)

可以找到相关文档here。

Answer 2

您无需检查单词的长度，正则表达式会为您完成。

在你的代码中你有一个错误，最后一个 ifelse 条件将重写之前的任何输出，例如，如果第二个 ifelse 为真而第三个 ifelse 则输出为假，你正在制作 and 条件。

我纠正你的错误。

完整代码如下：

df %>% mutate(
         invalid = FALSE,

         # Contains three consonants in a row: e.g. "ngbas"
         invalid = ifelse(str_detect(text,regex("[BCDFGHJKLMNPQRSTVWXYZ]{3}", ignore_case = TRUE)),  TRUE, invalid),   # <--- Regex missing
         # More than 3 identical characters in a row: e.g. "flahaaaa" 
         invalid = ifelse(str_detect(text,regex("([a-zA-Z])\1{3}", ignore_case = TRUE)),  TRUE, invalid)    # <--- Regex missing
       )

正则表达式问题（检查某些重复的字符串）

Problem with regex (check string for certain repetitions)

regex

r

tidyverse