删除 R 中的标点符号，但保留标点符号/"sentence markers"“！”，“。”，“？”在句末

Question

我目前正在尝试在我正在使用的文本语料库上创建一个基于句子的 LDA。为了检测句子并拆分它们，我使用了 openNLP 包中的 sent_detect() 函数。

但是，我正在使用的数据集非常不干净，并且包含许多其他 "punctuation"，我想在使用 sent_detect() 函数之前将其删除。

通常，我会在文本语料库上使用以下代码（来自 tm 包）来删除标点符号：
text.corpus <- tm_map(text.corpus, removePunctuation)

但是，此函数会删除所有类型的标点符号，包括 "."、"?"、"!"、"|" 被 sent_detect() 函数用来检测句子。因此，将文本拆分成单独的句子会破坏我的目标。

有没有办法使用上述 tm_map() 函数删除标点符号但排除特定 "sentence-indicators" (*".", "?", "!", "|"** )？

这是一个文本示例：

not funny; - i did not like the movie / film at all (since the actors were terrible). however, i really enjoyed the scenery!

通常，上面的removePunctuation会去掉所有的标点符号，留下下面的句子：

not funny i did not like the movie film at all since the actors were terrible however i really enjoyed the scenery

然而，我想要结束的是：

not funny i did not like the movie film at all since the actors were terrible. however i really enjoyed the scenery!

谢谢！

Ps：使用 openNLP 包不是必须的，我也愿意接受任何其他解决方案！

Answer 1

您可以使用 gsub 将所有要删除的字符定义为模式，将它们与交替标记 | 连接起来，并确保 ( 和 ) 用 \ 正确转义，并用 "" 替换模式——也就是说，在替换参数中没有任何内容：

gsub(";|- |/ |,|\(|\)", "", s)
[1] "not funny i did not like the movie film at all since the actors were terrible. however i really enjoyed the scenery!"

数据：

s <- "not funny; - i did not like the movie / film at all (since the actors were terrible). however, i really enjoyed the scenery!"

Answer 2

使用 stringr 和 not-not-statement（感谢 Chris Ruehlemann 的评论）：

s <- "not funny; - i did not like the movie / film at all (since the actors were terrible). however, i really enjoyed the scenery!"

str_remove_all(s, "[^[^[[:punct:]]]!|.|?]")
[1] "not funny  i did not like the movie  film at all since the actors were terrible. however i really enjoyed the scenery!"

删除 R 中的标点符号，但保留标点符号/"sentence markers"“！”，“。”，“？”在句末

Remove punctuation in R but leave punctuation/"sentence markers" "!", ".", "?" at the end of a sentence

r

gsub

lda

topic-modeling

data-cleaning