包 tm:removeWords 如果指定,如何避免删除 CERTIAN(特别是否定)"english" 停用词?
Package tm: removeWords How do I avoid removing CERTIAN (negations specifically) "english" stopwords if specified?
我想通过以下方式使用 removeWords
(stopwords("english")
) 函数:corpus <- tm_map(corpus,removeWords, stopwords("english"))
但一些词,如 "not",以及其他我想保留的否定.
是否可以使用 removeWords, stopwords("english")
函数但在指定的情况下排除该列表中的某些单词?
例如,我怎样才能防止删除 "not"?
(Secondary)是否可以将这种类型的控制列表设置为所有"negations"?
我不想只使用我感兴趣的停用列表中的词来创建自己的自定义列表。
您可以通过计算 stopwords("en")
和要排除的单词列表之间的差异来创建自定义停用词列表:
exceptions <- c("not")
my_stopwords <- setdiff(stopwords("en"), exceptions)
如果您需要删除所有否定,您可以从 stopwords()
列表中 grep
它们:
exceptions <- grep(pattern = "not|n't", x = stopwords(), value = TRUE)
# [1] "isn't" "aren't" "wasn't" "weren't" "hasn't" "haven't" "hadn't" "doesn't" "don't" "didn't"
# [11] "won't" "wouldn't" "shan't" "shouldn't" "can't" "cannot" "couldn't" "mustn't" "not"
my_stopwords <- setdiff(stopwords("en"), exceptions)
我想通过以下方式使用 removeWords
(stopwords("english")
) 函数:corpus <- tm_map(corpus,removeWords, stopwords("english"))
但一些词,如 "not",以及其他我想保留的否定.
是否可以使用 removeWords, stopwords("english")
函数但在指定的情况下排除该列表中的某些单词?
例如,我怎样才能防止删除 "not"?
(Secondary)是否可以将这种类型的控制列表设置为所有"negations"?
我不想只使用我感兴趣的停用列表中的词来创建自己的自定义列表。
您可以通过计算 stopwords("en")
和要排除的单词列表之间的差异来创建自定义停用词列表:
exceptions <- c("not")
my_stopwords <- setdiff(stopwords("en"), exceptions)
如果您需要删除所有否定,您可以从 stopwords()
列表中 grep
它们:
exceptions <- grep(pattern = "not|n't", x = stopwords(), value = TRUE)
# [1] "isn't" "aren't" "wasn't" "weren't" "hasn't" "haven't" "hadn't" "doesn't" "don't" "didn't"
# [11] "won't" "wouldn't" "shan't" "shouldn't" "can't" "cannot" "couldn't" "mustn't" "not"
my_stopwords <- setdiff(stopwords("en"), exceptions)