正则表达式后视中的可选模式部分

Question

在下面的示例中，我试图提取 'Supreme Court' 或 'Supreme Court of the United States' 和下一个日期（包括日期）之间的文本。下面的结果不是我想要的，因为结果 2 包括“美国”。

我假设错误是由于 .*? 部分引起的，因为 . 也可以匹配 'of the United States'。任何想法如何排除它？我想更一般地说，问题是如何将可选的 'element' 包含到后视中（这似乎是不可能的，因为 ? 使其成为非固定长度输入）。非常感谢！

library(tidyverse)
txt <- c("The US Supreme Court decided on 2 April 2020 The Supreme Court of the United States decided on 5 March 2011 also.")

str_extract_all(txt, regex("(?<=Supreme Court)(\sof the United States)?.*?\d{1,2}\s\w+\s\d{2,4}"))
#> [[1]]
#> [1] " decided on 2 April 2020"                     
#> [2] " of the United States decided on 5 March 2011"

^{由 reprex package (v2.0.1)}

于 2021-12-09 创建

我也试过了

   str_extract_all(txt, regex("(?<=(Supreme Court)|(Supreme Court of the United States)).*?\d{1,2}\s\w+\s\d{2,4}"))

但是结果是一样的

Answer 1

您可以使用 str_match_all 和群组捕获来做到这一点：

str_match_all(txt, regex("Supreme Court(?:\sof the United States)?(.*?\d{1,2}\s\w+\s\d{2,4})")) %>% 
  .[[1]] %>% .[, 2]

[1] " decided on 2 April 2020" " decided on 5 March 2011"

Answer 2

在这种情况下，我更愿意使用在 Base R 中实现的 perl 引擎，而不是 stringr/stringi 使用的 ICU 库引擎。

pattern <- "Supreme Court (of the United States ?)?\K.*?\d{1,2}\s\w+\s\d{2,4}"
regmatches(txt, gregexpr(pattern, txt, perl = TRUE))

[[1]]
[1] "decided on 2 April 2020" "decided on 5 March 2011"

正则表达式后视中的可选模式部分

Optional pattern part in regex lookbehind

regex

r

stringr