如何清除包含 "period-punctuation"（"e.g."、"st."、"rd."）但保留“.”的缩写在句子的结尾？

Question

我正在 R 中开发句子级别的 LDA，目前正在尝试使用 openNLP 包中的 sent_detect() 函数将我的文本数据拆分为单独的句子。

但是，我的文本数据包含很多带有 "period symbol" 但不标记句子结尾的缩写。以下是一些示例：“st. patricks day”、"oxford st."、"blue rd."、“e.g.”

有没有办法创建一个 gsub() 函数来解释此类 2 个字符的缩写并删除它们的“.”符号，以便 sent_detect() 函数不会错误地检测到它？不幸的是，这些缩写并不总是在两个单词之间，但有时它们确实也可以标记句子的结尾：

示例：

"I really liked Oxford st." - the "st." marks the end of a sentence and the "." should remain.

对

"Oxford st. was very busy." - the "st." does not stand at the end of a sentence, thus, the "."-symbol should be replaced.

我不确定是否有解决方案，但也许其他更熟悉句子级分析的人知道如何处理此类问题的方法。谢谢！

Answer 1

查看您之前提出的问题，我建议查看 textclean 包。很多你想要的东西都包含在那个包里了。任何缺失的功能都可以被挪用、重用或扩展。

只是用某些东西替换 "st." 会导致问题，因为它可能意味着 street 或 saint，但是 "st. patricks day" 很容易找到。您将遇到的问题是列出可能发生的事件并为它们找到替代方案。最容易使用的是翻译tables。下面我为一些缩写及其预期的长名称创建了一个 table。现在由您（或您的客户）指定您想要的最终结果。最好的方法是在 excel 或数据库中创建一个 table 并将其加载到 data.frame 中（并存储在某个地方以便于访问）。根据您的文字，这可能需要大量工作，但它会提高您的结果质量。

示例：

library(textclean)

text <- c("I really liked Oxford st.", "Oxford st. was very busy.",
          "e.g. st. Patricks day was on oxford st. and blue rd.")


# Create abbreviations table, making sure that we are looking for rd. and not just rd. Also should it be road or could it mean something else?

abbreviations <- data.frame(abbreviation = c("st. patricks day", "oxford st.", "rd\.", "e.g."),
                            replacement = c("saint patricks day","oxford street","road", "eg"))


# I use the replace_contraction function since you can replace the default contraction table with your own table.

text <- replace_contraction(text, abbreviations)

text
[1] "I really liked oxford street"                             "oxford street was very busy."                            
[3] "eg saint patricks day was on oxford street and blue road"

# as the result from above show missing end marks we use the following function to add them again.

text <- add_missing_endmark(text, ".")

text
[1] "I really liked oxford street."                             "oxford street was very busy."                             
[3] "eg saint patricks day was on oxford street and blue road."

textclean 具有一系列 replace_zzz 函数，大部分基于包中的 mgsub 函数。查看包含所有功能的文档以了解它们的作用。

如何清除包含 "period-punctuation"（"e.g."、"st."、"rd."）但保留“.”的缩写在句子的结尾？

How to clean abbreviations containing a "period-punctuation" ("e.g.", "st.", "rd.") but leave the "." at the end of a sentence?

regex

r

text-mining

topic-modeling