R 中负向回顾的问题

Question

我有这组句子：

w <- c("so i said er well it would n't surprise me if it could bloody talk",  # quote marker
        "we got fifteen, well thirteen minutes",                              
        "well she brought a pie and she brought some er punch round",         
        "so your dad said well have n't i been soft ?",                       # quote marker
        "And he went [pause] well I can't feel any. ",                        # quote marker
        "I goes well they'll improve the grant to start off with",            # quote marker
        "so with the chips as well this is about one sixty .",                
        "well we 're not all the same are we , but")

所有字符串都包含单词 well。我对 well 作为引号标记的那些字符串很感兴趣，如 said、goes 和 went 的出现所示。使用 positive lookbehind 我可以匹配这些句子：

grep("(?<=said|goes|went).*well", w, value = T, perl = T)
[1] "so i said er well it would n't surprise me if it could bloody talk"
[2] "so your dad said well have n't i been soft ?"                      
[3] "And he went [pause] well I can't feel any. "                       
[4] "I goes well they'll improve the grant to start off with"

我遇到的问题是 negative lookbehind 以匹配那些 'well' 是 not 引号标记的字符串不工作。例如，这匹配所有内容：

grep("(?<!said|goes|went).*well", w, value = T, perl = T)
[1] "so i said er well it would n't surprise me if it could bloody talk" # not match
[2] "we got fifteen, well thirteen minutes"                              # match
[3] "well she brought a pie and she brought some er punch round"         # match    
[4] "so your dad said well have n't i been soft ?"                       # not match         
[5] "And he went [pause] well I can't feel any. "                        # not match             
[6] "I goes well they'll improve the grant to start off with"            # not match         
[7] "so with the chips as well this is about one sixty ."                # match      
[8] "well we 're not all the same are we , but"                          # match

为什么它不正确匹配，必须如何更改才能正确匹配？

提前致谢！

Answer 1

发生这种情况是因为 (?<!said|goes|went) 匹配字符串中的位置，该位置不是立即前面定义的字符串。 .* 然后尽可能多地匹配换行字符以外的任何 0+ 个字符，然后 well 被匹配。有很多这样的有效职位。

最简单的是匹配said、goes或went出现在well之前的字符串并跳过它们，然后匹配所有的well其他上下文：

\b(?:said|goes|went)\b.*\bwell\b(*SKIP)(*F)|\bwell\b

参见regex demo。

注意：如果您使用像^(?!.*\b(?:said|goes|went)\b).*\bwell\b这样的解决方案，当said、goes或[=17时，您可能会得到假阴性=]出现在之后well.

图案详情

\b(?:said|goes|went)\b.*\bwell\b(*SKIP)(*F) - 一个完整的单词：said、goes 或 went，然后是尽可能多的任意 0 个或更多字符，然后是一个完整的单词 well，找到这个匹配后，它被丢弃，正则表达式引擎开始在当前失败的位置寻找匹配
| - 或
\bwell\b - 一个完整的单词 well.

看到一个 R demo:

grep("\b(?:said|goes|went)\b.*\bwell\b(*SKIP)(*F)|\bwell\b", w, value = TRUE, perl = TRUE)
# [1] "we got fifteen, well thirteen minutes"                     
# [2] "well she brought a pie and she brought some er punch round"
# [3] "so with the chips as well this is about one sixty ."       
# [4] "well we 're not all the same are we , but"

R 中负向回顾的问题

Issue with negative lookbehind in R

regex

r

regex-lookarounds