停用词出现在最有影响力的词中

Question

我正在运行编写一些 NLP 代码，试图在调查中找到最具影响力（正面或负面）的词。我的问题是，虽然我成功地向 NLTK 停用词文件添加了一些额外的停用词，但它们后来不断成为有影响力的词。

所以，我有一个数据框，第一列包含分数，第二列包含评论。

我添加了额外的停用词：

stopwords = stopwords.words('english')
extra = ['Cat', 'Dog']
stopwords.extend(extra)

我检查它们是否被添加，前后使用len方法。

我创建此功能是为了从我的评论中删除标点符号和停用词：

def text_process(comment):
   nopunc = [char for char in comment if char not in string.punctuation]
   nopunc = ''.join(nopunc)
   return [word for word in nopunc.split() if word.lower() not in stopwords]

我运行模型（不会包含整个代码，因为它没有区别）：

corpus = df['Comment']
y = df['Label']
vectorizer = CountVectorizer(analyzer=text_process)
x = vectorizer.fit_transform(corpus)

...

然后获取最有影响力的词：

feature_to_coef = {word: coef for word, coef in zip(vectorizer.get_feature_names(), nb.coef_[0])}


for best_positive in sorted(
    feature_to_coef.items(), 
    key=lambda x: x[1], 
    reverse=True)[:20]:
    print (best_positive)

但是，Cat 和 Dog 在结果中。

我做错了什么，有什么想法吗？

非常感谢！

Answer 1

看起来是因为你把单词 'Cat' 和 'Dog'

大写了

在你的 text_process 函数中，你有 if word.lower() not in stopwords 只有当停用词是小写时才有效

停用词出现在最有影响力的词中

Stopwords coming up in most influential words

python

nlp

nltk

sentiment-analysis