以 r 中的科学参考编号结尾的单独句子

Question

我正在进行一个项目，其中一个步骤是将科学文章的文本分成句子。为此，我正在使用 textrank，据我所知，它会查找 . 或 ? 或 ! 等来识别标记化句子的结尾。

我运行遇到的问题是句子以句点结尾，紧接着是参考编号（也可能在括号中）。下面的例子代表了我到目前为止识别和收集的模式。


xx = c ("hello.1 World", "hello.1,2 World",  "hello.(1) world", "hello.(1,2) World", "hello.[1,2] World", "hello.[1] World")

我做了一些搜索，看起来“句子边界检测”本身就是一门复杂且特定领域的科学。

我能想到的解决这个问题的唯一方法（至少在我的情况下）是编写一个正则表达式，在句点后添加一个 space 以便 textrank 可以识别它使用其通常的模式。

关于如何使用 R 中的正则表达式执行此操作的任何建议？我尽力在网上搜索，但找不到答案。

这个问题解释了如何在小写字母和大写字母之间添加 space。 Add space between two letters in a string in R 就我而言，我相信我需要在字母后跟句点和数字/括号之间添加 space。

我的预期输出是这样的：

("hello. 1 World", "hello. 1,2 World",  "hello. (1) world", "hello. (1,2) World", "hello. [1,2] World", "hello. [1] World")

谢谢

Answer 1

对于您提供给我们的确切样本输入，您可以按以下模式进行正则表达式搜索：

\.(?=\d+|\(\d+(?:,\d+)*\)|\[\d+(?:,\d+)*\])

然后替换为点后跟一个 space。示例脚本：

xx <- c("hello.1 World", "hello.1,2 World", "hello.(1) world", "hello.(1,2) World",
        "hello.[1,2] World", "hello.[1] World")
output <- gsub("\.(?=\d+|\(\d+(?:,\d+)*\)|\[\d+(?:,\d+)*\])", ". ", xx, perl=TRUE)
output

[1] "hello. 1 World"     "hello. 1,2 World"   "hello. (1) world"
[4] "hello. (1,2) World" "hello. [1,2] World" "hello. [1] World"

以 r 中的科学参考编号结尾的单独句子

Separate sentences ending with a scientific reference number in r

regex

r

tokenize

sentence