提取字符串中的第一句话

Question

我想用正则表达式从后面提取第一句话。我要实施的规则（我知道这不是通用解决方案）是从字符串开始 ^ 提取到（包括） 前面的第一个 period/exclamation/question 标记小写字母或数字.

require(stringr)

x = "Bali bombings: U.S. President George W. Bush amongst many others has condemned the perpetrators of the Bali car bombing of October 11. The death toll has now risen to at least 187."

到目前为止，我最好的猜测是尝试实施非贪婪 string-before-match approach 在这种情况下失败：

str_extract(x, '.+?(?=[a-z0-9][.?!] )')
[1] NA

非常感谢任何提示。

Answer 1

您将 [a-z0-9][.?!] 放入 non-consuming 前瞻模式，如果您打算使用 str_extract:

，则需要使其消耗

> str_extract(x, '.*?[a-z0-9][.?!](?= )')
[1] "Bali bombings: U.S. President George W. Bush amongst many others has condemned the perpetrators of the Bali car bombing of October 11."

参见 this regex demo。

详情

.*? - 除换行符以外的任何 0+ 个字符
[a-z0-9] - ASCII 小写字母或数字
[.?!] - .、? 或 !
(?= ) - 后跟文字 space.

或者，您可以使用 sub:

sub("([a-z0-9][?!.])\s.*", "\1", x)

参见 this regex demo。

详情

([a-z0-9][?!.]) - 第 1 组（从替换模式中引用 </code>）：一个 ASCII 小写字母或数字，然后是 <code>?、! 或 .
\s - 白space
.* - 任何 0+ 个字符，尽可能多（直到字符串末尾）。

Answer 2

corpus 在确定句子边界时对缩写有特殊处理：

library(corpus)       
text_split(x, "sentences")
#>   parent index text                                                                                                                           
#> 1 1          1 Bali bombings: U.S. President George W. Bush amongst many others #> has condemned the perpetrators of the Bali car bombing of Oct…
#> 2 1          2 The death toll has now risen to at least 187.

还有有用的数据集，其中包含许多语言（包括英语）的常用缩写。参见 corpus::abbreviations_en，可用于消除句子边界的歧义。

提取字符串中的第一句话

Extract first sentence in string

regex

r

stringr