包 tm:removeWords 如果指定,如何避免删除 CERTIAN(特别是否定)"english" 停用词?

Package tm: removeWords How do I avoid removing CERTIAN (negations specifically) "english" stopwords if specified?

我想通过以下方式使用 removeWords (stopwords("english")) 函数:corpus <- tm_map(corpus,removeWords, stopwords("english")) 但一些词,如 "not",以及其他我想保留的否定.

是否可以使用 removeWords, stopwords("english") 函数但在指定的情况下排除该列表中的某些单词?

例如,我怎样才能防止删除 "not"?

(Secondary)是否可以将这种类型的控制列表设置为所有"negations"?

我不想只使用我感兴趣的停用列表中的词来创建自己的自定义列表。

您可以通过计算 stopwords("en") 和要排除的单词列表之间的差异来创建自定义停用词列表:

exceptions   <- c("not")
my_stopwords <- setdiff(stopwords("en"), exceptions)

如果您需要删除所有否定,您可以从 stopwords() 列表中 grep 它们:

exceptions <- grep(pattern = "not|n't", x = stopwords(), value = TRUE)
# [1] "isn't"     "aren't"    "wasn't"    "weren't"   "hasn't"    "haven't"   "hadn't"    "doesn't"   "don't"     "didn't"   
# [11] "won't"     "wouldn't"  "shan't"    "shouldn't" "can't"     "cannot"    "couldn't"  "mustn't"   "not"
my_stopwords <- setdiff(stopwords("en"), exceptions)