如何修改 python 中的 NLTK 停用词列表？

Question

我是 python/programming 社区的新手，所以请原谅我提出一个相对简单的问题：我想在对 csv 文件进行词形还原之前过滤掉停用词。但我需要停用词 "this"/"these" 包含在最终集中。

在 Python 中导入 nltk 停用词并将其定义为

之后

stopwords = set(stopwords.words('english'))

... 我如何修改此集合以保留 "this"/"these"？

我知道我可以手动列出除了这两个有问题的每个词，但我一直在寻找更优雅的解决方案。

Answer 1

如果您希望这些停用词包含在您的最终集中，只需将它们从默认停用词列表中删除即可：

new_stopwords = set(stopwords.words('english')) - {'this', 'these'}

或者，

to_remove = ['this', 'these']
new_stopwords = set(stopwords.words('english')).difference(to_remove)

set.difference 接受任何可迭代对象。

How can I modify the NLTK the stop word list in python?