正则表达式问题(检查某些重复的字符串)
Problem with regex (check string for certain repetitions)
我想检查文本中是否有 a) 连续三个辅音或 b) 连续四个相同的字母。有人可以帮我解决正则表达式问题吗?
library(tidyverse)
df <- data.frame(text = c("Completely valid", "abcdefg", "blablabla", "flahaaaa", "asdf", "another text", "a last one", "sj", "ngbas"))
consonants <- c("q", "w", "r", "t", "z", "p", "s", "d", "f", "g", "h", "k", "l", "m", "n", "b", "x")
df %>% mutate(
invalid = FALSE,
# Length too short
invalid = ifelse(nchar(text)<3, TRUE, invalid),
# Contains three consonants in a row: e.g. "ngbas"
invalid = ifelse(str_detect(text,"???"), TRUE, FALSE), # <--- Regex missing
# More than 3 identical characters in a row: e.g. "flahaaaa"
invalid = ifelse(str_detect(text,"???"), TRUE, FALSE) # <--- Regex missing
)
连续三个辅音:
[qwrtzpsdfghklmnbx]{3}
特定字符的长度 > 3 的序列:
([a-z])(\1){3}
# The double backslash occurs due to its role as the escape character in strings.
后者使用反向引用。该数字表示分配给所引用的捕获组(= 括号中的表达式)的序号 - 在本例中为拉丁小写字母的字符 class。
为清楚起见,此处不考虑字符大小写。
没有反向引用,解决方案有点冗长:
(aaaa|bbbb|cccc|dddd|eeee|ffff|gggg|hhhh|iiii|jjjj|kkkk|llll|mmmm|nnnn|oooo|pppp|qqqq|rrrr|ssss|tttt|uuuu|vvvv|wwww|xxxx|yyyy|zzzz)
可以找到相关文档here。
您无需检查单词的长度,正则表达式会为您完成。
在你的代码中你有一个错误,最后一个 ifelse 条件将重写之前的任何输出,例如,如果第二个 ifelse 为真而第三个 ifelse 则输出为假,你正在制作 and 条件。
我纠正你的错误。
完整代码如下:
df %>% mutate(
invalid = FALSE,
# Contains three consonants in a row: e.g. "ngbas"
invalid = ifelse(str_detect(text,regex("[BCDFGHJKLMNPQRSTVWXYZ]{3}", ignore_case = TRUE)), TRUE, invalid), # <--- Regex missing
# More than 3 identical characters in a row: e.g. "flahaaaa"
invalid = ifelse(str_detect(text,regex("([a-zA-Z])\1{3}", ignore_case = TRUE)), TRUE, invalid) # <--- Regex missing
)
我想检查文本中是否有 a) 连续三个辅音或 b) 连续四个相同的字母。有人可以帮我解决正则表达式问题吗?
library(tidyverse)
df <- data.frame(text = c("Completely valid", "abcdefg", "blablabla", "flahaaaa", "asdf", "another text", "a last one", "sj", "ngbas"))
consonants <- c("q", "w", "r", "t", "z", "p", "s", "d", "f", "g", "h", "k", "l", "m", "n", "b", "x")
df %>% mutate(
invalid = FALSE,
# Length too short
invalid = ifelse(nchar(text)<3, TRUE, invalid),
# Contains three consonants in a row: e.g. "ngbas"
invalid = ifelse(str_detect(text,"???"), TRUE, FALSE), # <--- Regex missing
# More than 3 identical characters in a row: e.g. "flahaaaa"
invalid = ifelse(str_detect(text,"???"), TRUE, FALSE) # <--- Regex missing
)
连续三个辅音:
[qwrtzpsdfghklmnbx]{3}
特定字符的长度 > 3 的序列:
([a-z])(\1){3}
# The double backslash occurs due to its role as the escape character in strings.
后者使用反向引用。该数字表示分配给所引用的捕获组(= 括号中的表达式)的序号 - 在本例中为拉丁小写字母的字符 class。
为清楚起见,此处不考虑字符大小写。
没有反向引用,解决方案有点冗长:
(aaaa|bbbb|cccc|dddd|eeee|ffff|gggg|hhhh|iiii|jjjj|kkkk|llll|mmmm|nnnn|oooo|pppp|qqqq|rrrr|ssss|tttt|uuuu|vvvv|wwww|xxxx|yyyy|zzzz)
可以找到相关文档here。
您无需检查单词的长度,正则表达式会为您完成。
在你的代码中你有一个错误,最后一个 ifelse 条件将重写之前的任何输出,例如,如果第二个 ifelse 为真而第三个 ifelse 则输出为假,你正在制作 and 条件。
我纠正你的错误。
完整代码如下:
df %>% mutate(
invalid = FALSE,
# Contains three consonants in a row: e.g. "ngbas"
invalid = ifelse(str_detect(text,regex("[BCDFGHJKLMNPQRSTVWXYZ]{3}", ignore_case = TRUE)), TRUE, invalid), # <--- Regex missing
# More than 3 identical characters in a row: e.g. "flahaaaa"
invalid = ifelse(str_detect(text,regex("([a-zA-Z])\1{3}", ignore_case = TRUE)), TRUE, invalid) # <--- Regex missing
)