R 中负向回顾的问题
Issue with negative lookbehind in R
我有这组句子:
w <- c("so i said er well it would n't surprise me if it could bloody talk", # quote marker
"we got fifteen, well thirteen minutes",
"well she brought a pie and she brought some er punch round",
"so your dad said well have n't i been soft ?", # quote marker
"And he went [pause] well I can't feel any. ", # quote marker
"I goes well they'll improve the grant to start off with", # quote marker
"so with the chips as well this is about one sixty .",
"well we 're not all the same are we , but")
所有字符串都包含单词 well
。我对 well
作为引号标记的那些字符串很感兴趣,如 said
、goes
和 went
的出现所示。使用 positive lookbehind 我可以匹配这些句子:
grep("(?<=said|goes|went).*well", w, value = T, perl = T)
[1] "so i said er well it would n't surprise me if it could bloody talk"
[2] "so your dad said well have n't i been soft ?"
[3] "And he went [pause] well I can't feel any. "
[4] "I goes well they'll improve the grant to start off with"
我遇到的问题是 negative lookbehind 以匹配那些 'well' 是 not 引号标记的字符串不 工作。例如,这匹配所有内容:
grep("(?<!said|goes|went).*well", w, value = T, perl = T)
[1] "so i said er well it would n't surprise me if it could bloody talk" # not match
[2] "we got fifteen, well thirteen minutes" # match
[3] "well she brought a pie and she brought some er punch round" # match
[4] "so your dad said well have n't i been soft ?" # not match
[5] "And he went [pause] well I can't feel any. " # not match
[6] "I goes well they'll improve the grant to start off with" # not match
[7] "so with the chips as well this is about one sixty ." # match
[8] "well we 're not all the same are we , but" # match
为什么它不正确匹配,必须如何更改才能正确匹配?
提前致谢!
发生这种情况是因为 (?<!said|goes|went)
匹配字符串中的位置,该位置不是 立即 前面定义的字符串。 .*
然后尽可能多地匹配换行字符以外的任何 0+ 个字符,然后 well
被匹配。有很多这样的有效职位。
最简单的是匹配said
、goes
或went
出现在well
之前的字符串并跳过它们,然后匹配所有的well
其他上下文:
\b(?:said|goes|went)\b.*\bwell\b(*SKIP)(*F)|\bwell\b
参见regex demo。
注意:如果您使用像^(?!.*\b(?:said|goes|went)\b).*\bwell\b
这样的解决方案,当said
、goes
或[=17时,您可能会得到假阴性=]出现在之后well
.
图案详情
\b(?:said|goes|went)\b.*\bwell\b(*SKIP)(*F)
- 一个完整的单词:said
、goes
或 went
,然后是尽可能多的任意 0 个或更多字符,然后是一个完整的单词 well
,找到这个匹配后,它被丢弃,正则表达式引擎开始在当前失败的位置寻找匹配
|
- 或
\bwell\b
- 一个完整的单词 well
.
看到一个 R demo:
grep("\b(?:said|goes|went)\b.*\bwell\b(*SKIP)(*F)|\bwell\b", w, value = TRUE, perl = TRUE)
# [1] "we got fifteen, well thirteen minutes"
# [2] "well she brought a pie and she brought some er punch round"
# [3] "so with the chips as well this is about one sixty ."
# [4] "well we 're not all the same are we , but"
我有这组句子:
w <- c("so i said er well it would n't surprise me if it could bloody talk", # quote marker
"we got fifteen, well thirteen minutes",
"well she brought a pie and she brought some er punch round",
"so your dad said well have n't i been soft ?", # quote marker
"And he went [pause] well I can't feel any. ", # quote marker
"I goes well they'll improve the grant to start off with", # quote marker
"so with the chips as well this is about one sixty .",
"well we 're not all the same are we , but")
所有字符串都包含单词 well
。我对 well
作为引号标记的那些字符串很感兴趣,如 said
、goes
和 went
的出现所示。使用 positive lookbehind 我可以匹配这些句子:
grep("(?<=said|goes|went).*well", w, value = T, perl = T)
[1] "so i said er well it would n't surprise me if it could bloody talk"
[2] "so your dad said well have n't i been soft ?"
[3] "And he went [pause] well I can't feel any. "
[4] "I goes well they'll improve the grant to start off with"
我遇到的问题是 negative lookbehind 以匹配那些 'well' 是 not 引号标记的字符串不 工作。例如,这匹配所有内容:
grep("(?<!said|goes|went).*well", w, value = T, perl = T)
[1] "so i said er well it would n't surprise me if it could bloody talk" # not match
[2] "we got fifteen, well thirteen minutes" # match
[3] "well she brought a pie and she brought some er punch round" # match
[4] "so your dad said well have n't i been soft ?" # not match
[5] "And he went [pause] well I can't feel any. " # not match
[6] "I goes well they'll improve the grant to start off with" # not match
[7] "so with the chips as well this is about one sixty ." # match
[8] "well we 're not all the same are we , but" # match
为什么它不正确匹配,必须如何更改才能正确匹配?
提前致谢!
发生这种情况是因为 (?<!said|goes|went)
匹配字符串中的位置,该位置不是 立即 前面定义的字符串。 .*
然后尽可能多地匹配换行字符以外的任何 0+ 个字符,然后 well
被匹配。有很多这样的有效职位。
最简单的是匹配said
、goes
或went
出现在well
之前的字符串并跳过它们,然后匹配所有的well
其他上下文:
\b(?:said|goes|went)\b.*\bwell\b(*SKIP)(*F)|\bwell\b
参见regex demo。
注意:如果您使用像^(?!.*\b(?:said|goes|went)\b).*\bwell\b
这样的解决方案,当said
、goes
或[=17时,您可能会得到假阴性=]出现在之后well
.
图案详情
\b(?:said|goes|went)\b.*\bwell\b(*SKIP)(*F)
- 一个完整的单词:said
、goes
或went
,然后是尽可能多的任意 0 个或更多字符,然后是一个完整的单词well
,找到这个匹配后,它被丢弃,正则表达式引擎开始在当前失败的位置寻找匹配|
- 或\bwell\b
- 一个完整的单词well
.
看到一个 R demo:
grep("\b(?:said|goes|went)\b.*\bwell\b(*SKIP)(*F)|\bwell\b", w, value = TRUE, perl = TRUE)
# [1] "we got fifteen, well thirteen minutes"
# [2] "well she brought a pie and she brought some er punch round"
# [3] "so with the chips as well this is about one sixty ."
# [4] "well we 're not all the same are we , but"