正则表达式根据关键字的出现次数编辑文本

Question

我正在努力解决以下问题的正则表达式：比如说，我有一系列字符串，它们都包含多次出现的关键字 Appendix 或 appendix，如下所示：

text <- c("Appendix abc Appendix def appendix final",
          "blah blah Appendix abc Appendix finalissimo")

并且我想删除最后一次出现的“Appendix”之后的所有内容，包括关键字本身以获得以下所需的输出：

1 Appendix abc Appendix def
2 blah blah Appendix abc

我知道 (a) tidyverse 解决方案 is/are 可能（例如，，但我对 regex 解决方案。我已经尝试了很多这样的正则表达式解决方案，但 none 似乎有效。我认为最有希望的是这涉及负先行和反向引用，但它也不会产生预期的结果：

library(stringr)
str_extract(text, "(?i).*(?!(appendix).*\1)")

如果能提供此解决方案为何不起作用的建议以及有效的正则表达式解决方案，我将不胜感激。

Answer 1

我会在这里使用具有前瞻逻辑的正则表达式：

text <- c("Appendix abc Appendix def appendix final",
          "blah blah Appendix abc Appendix finalissimo")
output <- sub("(?i)\s+appendix(?!.*\bappendix\b).*", "", text, perl=TRUE)
output

[1] "Appendix abc Appendix def" "blah blah Appendix abc"

Answer 2

您可以使用 sub。第一个 .* 是贪心的，会拿走所有东西，直到 Appendix.*.

的最后一场比赛

sub("(.*)Appendix.*", "\1", text, TRUE)
#[1] "Appendix abc Appendix def " "blah blah Appendix abc "

正则表达式根据关键字的出现次数编辑文本

Regex to edit text depending on number of occurrence of key word

regex

r