为什么 NLTK NaiveBayes 分类器错误分类了一条记录？

Question

这是我第一次使用 Python 中的 nltk NaiveBayesClassifier 构建情感分析机器学习模型。我知道这个模型太简单了，但这对我来说只是第一步，下次我会尝试标记化的句子。

我当前模型的真正问题是：我已经在训练数据集中将单词 'bad' 明确标记为负数（正如您从 'negative_vocab' 变量中看到的那样）。但是，当我运行列表中的每个句子（小写）上的 NaiveBayesClassifier ['awesome movie'，“我喜欢它”，“它太糟糕了”]，分类器被错误地标记为 'it is so bad' 为阳性。

输入：

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import names

positive_vocab = [ 'awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)' ]
negative_vocab = [ 'bad', 'terrible','useless', 'hate', ':(' ]
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not','it','so','really' ]

def word_feats(words):
    return dict([(word, True) for word in words])

positive_features_1 = [(word_feats(positive_vocab), 'pos')]
negative_features_1 = [(word_feats(negative_vocab), 'neg')]
neutral_features_1 = [(word_feats(neutral_vocab), 'neu')]

train_set = negative_features_1 + positive_features_1 + neutral_features_1

classifier = NaiveBayesClassifier.train(train_set) 

# Predict
neg = 0
pos = 0
sentence = "Awesome movie. I like it. It is so bad"
sentence = sentence.lower()
words = sentence.split('.')

def word_feat(word):
    return dict([(word,True)])
#NOTE THAT THE FUNCTION 'word_feat(word)' I WROTE HERE IS DIFFERENT FROM THE 'word_feat(words)' FUNCTION I DEFINED EARLIER. THIS FUNCTION IS USED TO ITERATE OVER EACH OF THE THREE ELEMENTS IN THE LIST ['awesome movie', ' i like it', ' it is so bad'].

for word in words:
    classResult = classifier.classify(word_feat(word))
    if classResult == 'neg':
        neg = neg + 1
    if classResult == 'pos':
        pos = pos + 1
    print(str(word) + ' is ' + str(classResult))
    print()

输出：

awesome movie is pos

i like it is pos

it is so bad is pos

为了确保函数 'word_feat(word)' 遍历每个句子而不是每个单词或字母，我做了一些诊断代码来查看 'word_feat(word)':

中的每个元素是什么

for word in words:
    print(word_feat(word))

并打印出来：

{'awesome movie': True}
{' i like it': True}
{' it is so bad': True}

看来函数 'word_feat(word)' 是正确的？

有谁知道为什么分类器将 'It is so bad' 分类为阳性？如前所述，我在训练数据中明确将 'bad' 这个词标记为负数。

Answer 1

这个特别的失败是因为你的 word_feats() 函数需要一个单词列表（一个标记化的句子），但是你将每个单词单独传递给它......所以 word_feats() 迭代它的字母。您已经构建了一个分类器，它根据字符串包含的字母将字符串分类为正面或负面。

您可能处于这种困境，因为您没有注意变量的命名。在您的主循环中，变量 sentence、words 或 word 中的 none 包含其名称所声明的内容。要理解和改进您的程序，请从正确命名开始。

撇开错误不谈，这不是构建情感分类器的方式。训练数据应该是标记化句子的列表（每个句子都标有其情绪），而不是单个单词的列表。同样，您对标记化的句子进行分类。

Answer 2

让我展示一下您的代码重写。我在顶部附近所做的更改是添加 import re，因为使用正则表达式更容易标记化。定义 classifier 之前的所有其他内容都与您的代码相同。

我又添加了一个测试用例（确实非常负面），但更重要的是我使用了正确的变量名——这样就更难对正在发生的事情感到困惑：

test_data = "Awesome movie. I like it. It is so bad. I hate this terrible useless movie."
sentences = test_data.lower().split('.')

所以 sentences 现在包含 4 个字符串，每个字符串一个句子。我没有改变你的 word_feat() 函数。

为了使用分类器，我做了相当大的重写：

for sentence in sentences:
    if(len(sentence) == 0):continue
    neg = 0
    pos = 0
    for word in re.findall(r"[\w']+", sentence):
        classResult = classifier.classify(word_feat(word))
        print(word, classResult)
        if classResult == 'neg':
            neg = neg + 1
        if classResult == 'pos':
            pos = pos + 1
    print("\n%s: %d vs -%d\n"%(sentence,pos,neg))

外循环也是描述性的，所以sentence包含一个句子。

然后我有一个内部循环，我们对句子中的每个单词进行分类；我正在使用正则表达式将句子拆分为空格和标点符号：

 for word in re.findall(r"[\w']+", sentence):
     classResult = classifier.classify(word_feat(word))

剩下的只是基本的加法和报告。我得到这个输出：

awesome pos
movie neu

awesome movie: 1 vs -0

i pos
like pos
it pos

 i like it: 3 vs -0

it pos
is neu
so pos
bad neg

 it is so bad: 2 vs -1

i pos
hate neg
this pos
terrible neg
useless neg
movie neu

 i hate this terrible useless movie: 2 vs -3

我仍然和你一样 - "it is so bad" 被认为是积极的。通过额外的调试行我们可以看到这是因为 "it" 和 "so" 被认为是积极的词，而 "bad" 是唯一的消极词，所以总体上它是积极的。

我怀疑这是因为它没有在训练数据中看到这些词。

...是的，如果我将 "it" 和 "so" 添加到中性词列表中，我会得到 "it is so bad: 0 vs -1".

作为接下来要尝试的事情，我建议：

尝试使用更多训练数据；像这样的玩具示例存在噪声会淹没信号的风险。
考虑删除停用词。

Answer 3

这是为您修改后的代码

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import names
from nltk.corpus import stopwords

positive_vocab = [ 'awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)' ]
negative_vocab = [ 'bad', 'terrible','useless', 'hate', ':(' ]
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not','it','so','really' ]

def word_feats(words):
    return dict([(word, True) for word in words])

positive_features_1 = [(word_feats(positive_vocab), 'pos')]
negative_features_1 = [(word_feats(negative_vocab), 'neg')]
neutral_features_1 = [(word_feats(neutral_vocab), 'neu')]

train_set = negative_features_1 + positive_features_1 + neutral_features_1

classifier = NaiveBayesClassifier.train(train_set) 

# Predict
neg = 0
pos = 0
sentence = "Awesome movie. I like it. It is so bad."
sentence = sentence.lower()
sentences = sentence.split('.')   # these are actually list of sentences

for sent in sentences:
    if sent != "":
        words = [word for word in sent.split(" ") if word not in stopwords.words('english')]
        classResult = classifier.classify(word_feats(words))
        if classResult == 'neg':
            neg = neg + 1
        if classResult == 'pos':
            pos = pos + 1
        print(str(sent) + ' --> ' + str(classResult))
        print

我修改了您考虑 'list of words' 作为分类器输入的地方。但是实际上你需要一个一个地传递句子，这意味着你需要传递 'list of sentences'

此外，对于每个句子，你需要传递'words as features'，这意味着你需要在白色-space字符上拆分句子。

此外，如果您希望分类器在情感分析中正常工作，您需要降低对 "stop-words" 的偏好，例如 "it, they, is etc"。因为这些词不足以决定句子是肯定的、否定的还是中性的。

以上代码给出以下输出

awesome movie --> pos

 i like it --> pos

 it is so bad --> neg

所以对于任何分类器，训练分类器和预测分类器的输入格式应该相同。在训练您提供单词列表时，也尝试使用相同的方法来转换您的测试集。

Answer 4

你可以试试这个代码

from nltk.classify import NaiveBayesClassifier

def word_feats(words):
return dict([(word, True) for word in words])

positive_vocab = [ 'awesome', 'outstanding', 'fantastic','terrific','good','nice','great', ':)','love' ]
negative_vocab = [ 'bad', 'terrible','useless','hate',':(','kill','steal']
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not' ]

positive_features = [(word_feats(pos), 'pos') for pos in positive_vocab]
negative_features = [(word_feats(neg), 'neg') for neg in negative_vocab]
neutral_features = [(word_feats(neu), 'neu') for neu in neutral_vocab]

train_set = negative_features + positive_features + neutral_features

classifier = NaiveBayesClassifier.train(train_set) 

# Predict
neg = 0
pos = 0

sentence = " Awesome movie, I like it :)"
sentence = sentence.lower()
words = sentence.split(' ')
for word in words:
classResult = classifier.classify( word_feats(word))
if classResult == 'neg':
    neg = neg + 1
if classResult == 'pos':
    pos = pos + 1


print('Positive: ' + str(float(pos)/len(words)))
print('Negative: ' + str(float(neg)/len(words)))

结果是：正数：0.7142857142857143 负数：0.14285714285714285

为什么 NLTK NaiveBayes 分类器错误分类了一条记录？

Why did NLTK NaiveBayes classifier misclassify one record?

nlp

classification

nltk

sentiment-analysis

naivebayes