用 unnest_tokens() 标记句子，忽略缩写

Question

我正在使用优秀的 tidytext 包来标记几个段落中的句子。例如，我想采取以下段落：

"I am perfectly convinced by it that Mr. Darcy has no defect. He owns it himself without disguise."

并将其标记为两个句子

"I am perfectly convinced by it that Mr. Darcy has no defect."
"He owns it himself without disguise."

但是，当我使用 tidytext 的默认句子分词器时，我得到了三个句子。

代码

df <- data_frame(Example_Text = c("I am perfectly convinced by it that Mr. Darcy has no defect. He owns it himself without disguise."))


unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "sentences")

结果

# A tibble: 3 x 1
                              Sentence
                                <chr>
1 i am perfectly convinced by it that mr.
2                    darcy has no defect.
3    he owns it himself without disguise.

有什么简单的方法可以使用 tidytext 来标记句子而不用运行解决常见缩写的问题，例如 "Mr." 或 "Dr." 被解释为句子结尾？

Answer 1

您可以使用正则表达式作为拆分条件，但不能保证这会包括所有常见的敬语：

unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "regex",
              pattern = "(?<!\b\p{L}r)\.")

结果：

# A tibble: 2 x 1
                                                     Sentence
                                                        <chr>
1 i am perfectly convinced by it that mr. darcy has no defect
2                         he owns it himself without disguise

您当然可以始终创建自己的常用标题列表，并基于该列表创建正则表达式：

titles =  c("Mr", "Dr", "Mrs", "Ms", "Sr", "Jr")
regex = paste0("(?<!\b(", paste(titles, collapse = "|"), "))\.")
# > regex
# [1] "(?<!\b(Mr|Dr|Mrs|Ms|Sr|Jr))\."

unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "regex",
              pattern = regex)

Answer 2

corpus 和 quanteda 在确定句子边界时对缩写都有特殊处理。以下是如何使用 corpus:

拆分句子

library(dplyr)
library(corpus)
df <- data_frame(Example_Text = c("I am perfectly convinced by it that Mr. Darcy has no defect. He owns it himself without disguise."))

text_split(df$Example_Text, "sentences")
##   parent index text                                                         
## 1 1          1 I am perfectly convinced by it that Mr. Darcy has no defect. 
## 2 1          2 He owns it himself without disguise.

如果您想坚持使用 unnest_tokens，但想要更详尽的英文缩写列表，您可以遵循@useR 的建议，但使用 语料库 缩写列表（其中大部分取自 Common Locale Data Repository):

abbrevations_en
##  [1] "A."       "A.D."     "a.m."     "A.M."     "A.S."     "AA."       
##  [7] "AB."      "Abs."     "AD."      "Adj."     "Adv."     "Alt."    
## [13] "Approx."  "Apr."     "Aug."     "B."       "B.V."     "C."      
## [19] "C.F."     "C.O.D."   "Capt."    "Card."    "cf."      "Col."    
## [25] "Comm."    "Conn."    "Cont."    "D."       "D.A."     "D.C."    
## (etc., 155 total)

用 unnest_tokens() 标记句子，忽略缩写

Tokenizing sentences with unnest_tokens(), ignoring abbreviations

text

r

tidytext