stringr，str_extract：如何做正面回顾？

Question

很简单的问题。我只需要使用正则表达式正向后视来捕获一些字符串，但我没有找到一种方法。

这是一个例子，假设我有一些字符串：

library(stringr)
myStrings <- c("MFG: acme", "something else", "MFG: initech")

我想提取前缀为"MFG:"

的单词

> result_1  <- str_extract(myStrings,"MFG\s*:\s*\w+")
>
> result_1
[1] "MFG: acme"    NA             "MFG: initech"

几乎做到了，但我不想包括 "MFG:" 部分，所以这就是 "positive lookbehind" 的用途：

> result_2  <- str_extract(myStrings,"(?<=MFG\s*:\s*)\w+")
Error in stri_extract_first_regex(string, pattern, opts_regex = attr(pattern,  : 
  Look-Behind pattern matches must have a bounded maximum length. (U_REGEX_LOOK_BEHIND_LIMIT)
>

它抱怨需要一个 "bounded maximum length"，但我不知道在哪里指定它。我如何使积极的后视工作？我可以在哪里指定这个 "bounded maximum length"?

Answer 1

我们可以使用正则表达式环视。回溯只会采用完全匹配。

str_extract(myStrings, "(?<=MFG:\s)\w+")
#[1] "acme"    NA        "initech"

Answer 2

您需要使用 str_match，因为 "lookbehind" 的模式是文字，而您只是不知道空格的数量：

> result_1  <- str_match(myStrings,"MFG\s*:\s*(\w+)")
> result_1[,2]
##[1] "acme"    NA        "initech"

您需要的结果将在第二列中。

请注意，str_extract 不能在此处使用，因为该函数会丢弃捕获的值。

还有一个好处：lookbehind 在 ICU 正则表达式中不是 无限宽度，而是 约束宽度。所以，这也行得通：

> result_1  <- str_extract(myStrings,"(?<=MFG\s{0,100}:\s{0,100})\w+")
> result_1
[1] "acme"    NA        "initech"

Answer 3

我使用 lookbehind 在 python 中编写了代码。如果解析器找到 MFG: 那么它将获取下一个单词

txt="MFG: acme, something else, MFG: initech"
pattern=r"(?<=MFG\:)\s+\w+"
matches=re.findall(pattern,txt)
for match in matches:
   print(match)

输出：

 acme
 initech

stringr，str_extract：如何做正面回顾？

stringr, str_extract: how to do positive lookbehind?

regex

r

lookbehind

stringr