删除 R 中的标点符号,但保留标点符号/"sentence markers"“!”,“。”,“?”在句末

Remove punctuation in R but leave punctuation/"sentence markers" "!", ".", "?" at the end of a sentence

我目前正在尝试在我正在使用的文本语料库上创建一个基于句子的 LDA。 为了检测句子并拆分它们,我使用了 openNLP 包中的 sent_detect() 函数。

但是,我正在使用的数据集非常不干净,并且包含许多其他 "punctuation",我想在使用 sent_detect() 函数之前将其删除。

通常,我会在文本语料库上使用以下代码(来自 tm 包)来删除标点符号:
text.corpus <- tm_map(text.corpus, removePunctuation)

但是,此函数会删除所有类型的标点符号,包括 "."、"?"、"!"、"|"sent_detect() 函数用来检测句子。因此,将文本拆分成单独的句子会破坏我的目标。

有没有办法使用上述 tm_map() 函数删除标点符号但排除特定 "sentence-indicators" (*".", "?", "!", "|"** )?

这是一个文本示例:

not funny; - i did not like the movie / film at all (since the actors were terrible). however, i really enjoyed the scenery!

通常,上面的removePunctuation会去掉所有的标点符号,留下下面的句子:

not funny i did not like the movie film at all since the actors were terrible however i really enjoyed the scenery

然而,我想要结束的是:

not funny i did not like the movie film at all since the actors were terrible. however i really enjoyed the scenery!

谢谢!

Ps:使用 openNLP 包不是必须的,我也愿意接受任何其他解决方案!

您可以使用 gsub 将所有要删除的字符定义为模式,将它们与交替标记 | 连接起来,并确保 ()\ 正确转义,并用 "" 替换模式——也就是说,在替换参数中没有任何内容:

gsub(";|- |/ |,|\(|\)", "", s)
[1] "not funny i did not like the movie film at all since the actors were terrible. however i really enjoyed the scenery!"

数据:

s <- "not funny; - i did not like the movie / film at all (since the actors were terrible). however, i really enjoyed the scenery!"

使用 stringr 和 not-not-statement(感谢 Chris Ruehlemann 的评论):

s <- "not funny; - i did not like the movie / film at all (since the actors were terrible). however, i really enjoyed the scenery!"

str_remove_all(s, "[^[^[[:punct:]]]!|.|?]")
[1] "not funny  i did not like the movie  film at all since the actors were terrible. however i really enjoyed the scenery!"