用 unnest_tokens() 标记句子,忽略缩写
Tokenizing sentences with unnest_tokens(), ignoring abbreviations
我正在使用优秀的 tidytext
包来标记几个段落中的句子。例如,我想采取以下段落:
"I am perfectly convinced by it that Mr. Darcy has no defect. He owns it himself without disguise."
并将其标记为两个句子
- "I am perfectly convinced by it that Mr. Darcy has no defect."
- "He owns it himself without disguise."
但是,当我使用 tidytext
的默认句子分词器时,我得到了三个句子。
代码
df <- data_frame(Example_Text = c("I am perfectly convinced by it that Mr. Darcy has no defect. He owns it himself without disguise."))
unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "sentences")
结果
# A tibble: 3 x 1
Sentence
<chr>
1 i am perfectly convinced by it that mr.
2 darcy has no defect.
3 he owns it himself without disguise.
有什么简单的方法可以使用 tidytext
来标记句子而不用 运行 解决常见缩写的问题,例如 "Mr." 或 "Dr." 被解释为句子结尾?
您可以使用正则表达式作为拆分条件,但不能保证这会包括所有常见的敬语:
unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "regex",
pattern = "(?<!\b\p{L}r)\.")
结果:
# A tibble: 2 x 1
Sentence
<chr>
1 i am perfectly convinced by it that mr. darcy has no defect
2 he owns it himself without disguise
您当然可以始终创建自己的常用标题列表,并基于该列表创建正则表达式:
titles = c("Mr", "Dr", "Mrs", "Ms", "Sr", "Jr")
regex = paste0("(?<!\b(", paste(titles, collapse = "|"), "))\.")
# > regex
# [1] "(?<!\b(Mr|Dr|Mrs|Ms|Sr|Jr))\."
unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "regex",
pattern = regex)
corpus 和 quanteda 在确定句子边界时对缩写都有特殊处理。以下是如何使用 corpus:
拆分句子
library(dplyr)
library(corpus)
df <- data_frame(Example_Text = c("I am perfectly convinced by it that Mr. Darcy has no defect. He owns it himself without disguise."))
text_split(df$Example_Text, "sentences")
## parent index text
## 1 1 1 I am perfectly convinced by it that Mr. Darcy has no defect.
## 2 1 2 He owns it himself without disguise.
如果您想坚持使用 unnest_tokens
,但想要更详尽的英文缩写列表,您可以遵循@useR 的建议,但使用 语料库 缩写列表(其中大部分取自 Common Locale Data Repository):
abbrevations_en
## [1] "A." "A.D." "a.m." "A.M." "A.S." "AA."
## [7] "AB." "Abs." "AD." "Adj." "Adv." "Alt."
## [13] "Approx." "Apr." "Aug." "B." "B.V." "C."
## [19] "C.F." "C.O.D." "Capt." "Card." "cf." "Col."
## [25] "Comm." "Conn." "Cont." "D." "D.A." "D.C."
## (etc., 155 total)
我正在使用优秀的 tidytext
包来标记几个段落中的句子。例如,我想采取以下段落:
"I am perfectly convinced by it that Mr. Darcy has no defect. He owns it himself without disguise."
并将其标记为两个句子
- "I am perfectly convinced by it that Mr. Darcy has no defect."
- "He owns it himself without disguise."
但是,当我使用 tidytext
的默认句子分词器时,我得到了三个句子。
代码
df <- data_frame(Example_Text = c("I am perfectly convinced by it that Mr. Darcy has no defect. He owns it himself without disguise."))
unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "sentences")
结果
# A tibble: 3 x 1
Sentence
<chr>
1 i am perfectly convinced by it that mr.
2 darcy has no defect.
3 he owns it himself without disguise.
有什么简单的方法可以使用 tidytext
来标记句子而不用 运行 解决常见缩写的问题,例如 "Mr." 或 "Dr." 被解释为句子结尾?
您可以使用正则表达式作为拆分条件,但不能保证这会包括所有常见的敬语:
unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "regex",
pattern = "(?<!\b\p{L}r)\.")
结果:
# A tibble: 2 x 1
Sentence
<chr>
1 i am perfectly convinced by it that mr. darcy has no defect
2 he owns it himself without disguise
您当然可以始终创建自己的常用标题列表,并基于该列表创建正则表达式:
titles = c("Mr", "Dr", "Mrs", "Ms", "Sr", "Jr")
regex = paste0("(?<!\b(", paste(titles, collapse = "|"), "))\.")
# > regex
# [1] "(?<!\b(Mr|Dr|Mrs|Ms|Sr|Jr))\."
unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "regex",
pattern = regex)
corpus 和 quanteda 在确定句子边界时对缩写都有特殊处理。以下是如何使用 corpus:
拆分句子library(dplyr)
library(corpus)
df <- data_frame(Example_Text = c("I am perfectly convinced by it that Mr. Darcy has no defect. He owns it himself without disguise."))
text_split(df$Example_Text, "sentences")
## parent index text
## 1 1 1 I am perfectly convinced by it that Mr. Darcy has no defect.
## 2 1 2 He owns it himself without disguise.
如果您想坚持使用 unnest_tokens
,但想要更详尽的英文缩写列表,您可以遵循@useR 的建议,但使用 语料库 缩写列表(其中大部分取自 Common Locale Data Repository):
abbrevations_en
## [1] "A." "A.D." "a.m." "A.M." "A.S." "AA."
## [7] "AB." "Abs." "AD." "Adj." "Adv." "Alt."
## [13] "Approx." "Apr." "Aug." "B." "B.V." "C."
## [19] "C.F." "C.O.D." "Capt." "Card." "cf." "Col."
## [25] "Comm." "Conn." "Cont." "D." "D.A." "D.C."
## (etc., 155 total)