正则表达式 - 仅当它是最后一句话时才删除以某些单词开头的句子

Regex - remove sentences starting with certain words only if it is the last sentence

根据标题,我正在尝试清理大量短文本,删除以某些单词开头的句子——但前提是它是 last >1 句该文本。

假设我想删掉以'Jack is ...'
开头的最后一句话 这是一个不同情况的例子:

test_strings <- c("Jack is the tallest person.", 
                  "and Jack is the one who said, let there be fries.", 
                  "There are mirrors. And Jack is there to be suave.", 
                  "There are dogs. And jack is there to pat them. Very cool.", 
                  "Jack is your lumberjack. Jack, is super awesome.",
                  "Whereas Jack is, for the whole summer, sound asleep. Zzzz", 
                  "'Jack is so cool!' Jack is cool. Jack is also cold."
                  )

这是我目前拥有的正则表达式:"(?![A-Z'].+[\.|'] )[Jj]ack,? is.+\.$"

map_chr(test_strings, ~str_replace(.x, "(?![A-Z'].+[\.|'] )[Jj]ack,? is.+\.$", "[TRIM]"))

生成这些结果:

[1] "[TRIM]"                                                   
[2] "and [TRIM]"                                               
[3] "There are mirrors. And [TRIM]"                            
[4] "There are dogs. And [TRIM]"                               
[5] "Jack is your lumberjack. [TRIM]"                          
[6] "Whereas Jack is, for the whole summer, sound asleep. Zzzz"
[7] "'Jack is so cool!' Jack is cool. [TRIM]"  


## Basically my current regex is still too greedy. 
## No trimming should happen for the first 4 examples. 
## 5 - 7th examples are correct. 

## Explanations:
# (1) Wrong. Only one sentence; do not trim, but current regex trims it. 
# (2) Wrong. It is a sentence but does not start with 'Jack is'.
# (3) Wrong. Same situation as (2) -- the sentence starts with 'And' instead of 'Jack is'
# (4) Wrong. Same as (2) (3), but this time test with lowercase `jack`
# (5) Correct. Trim the second sentence as it is the last. Optional ',' removal is tested here.
# (6) Correct.
# (7) Correct. Sometimes texts do not begin with alphabets. 

感谢您的帮助!

gsub("^(.*\.)\s*Jack,? is[^.]*\.?$", "\1 [TRIM]", test_strings, ignore.case = TRUE)
# [1] "Jack is the tallest person."                              
# [2] "and Jack is the one who said, let there be fries."        
# [3] "There are mirrors. And Jack is there to be suave."        
# [4] "There are dogs. And jack is there to pat them. Very cool."
# [5] "Jack is your lumberjack. [TRIM]"                          
# [6] "Whereas Jack is, for the whole summer, sound asleep. Zzzz"
# [7] "'Jack is so cool!' Jack is cool. [TRIM]"                  

细分:

  • ^(.*\.)\s*: 因为我们 trim 之前至少要有一个句子,所以我们需要找到前面的点 \.;
  • Jack,? is 来自您的正则表达式
  • [^.]*\.?$:零个或多个“非 .-点”后跟 .-点和字符串结尾;如果你想在最后一段之后允许空白 space,那么你可以将其更改为 [^.]*\.?\s*$,在你的示例中似乎没有必要

您可以匹配一个点(或使用字符 class [.!?] 匹配更多字符,然后匹配包含 Jack 并以点结尾的最后一个句子(或再次字符 class 以匹配更多字符):

\.\K\h*[Jj]ack,? is[^.\n]*\.$

模式匹配:

  • \.\K匹配一个.忘记到目前为止匹配的是什么
  • \h*[Jj]ack,? is 匹配可选的水平空白字符,然后是 Jack 或 jack,以及可选的逗号和 is
  • [^.\n]*\. 可选择匹配除 . 或换行符
  • 之外的任何字符
  • $ 字符串结束

Regex demo | R demo

示例代码:

test_strings <- c("Jack is the tallest person.", 
                  "and Jack is the one who said, let there be fries.", 
                  "There are mirrors. And Jack is there to be suave.", 
                  "There are dogs. And jack is there to pat them. Very cool.", 
                  "Jack is your lumberjack. Jack, is super awesome.",
                  "Whereas Jack is, for the whole summer, sound asleep. Zzzz", 
                  "'Jack is so cool!' Jack is cool. Jack is also cold."
                  )

sub("\.\K\h*[Jj]ack,? is[^.\n]*\.$", " [TRIM]", test_strings, perl=TRUE)

输出

[1] "Jack is the tallest person."                              
[2] "and Jack is the one who said, let there be fries."        
[3] "There are mirrors. And Jack is there to be suave."        
[4] "There are dogs. And jack is there to pat them. Very cool."
[5] "Jack is your lumberjack. [TRIM]"                          
[6] "Whereas Jack is, for the whole summer, sound asleep. Zzzz"
[7] "'Jack is so cool!' Jack is cool. [TRIM]"