为什么 NLTK NaiveBayes 分类器错误分类了一条记录?
Why did NLTK NaiveBayes classifier misclassify one record?
这是我第一次使用 Python 中的 nltk NaiveBayesClassifier 构建情感分析机器学习模型。我知道这个模型太简单了,但这对我来说只是第一步,下次我会尝试标记化的句子。
我当前模型的真正问题是:我已经在训练数据集中将单词 'bad' 明确标记为负数(正如您从 'negative_vocab' 变量中看到的那样)。但是,当我 运行 列表中的每个句子(小写)上的 NaiveBayesClassifier ['awesome movie',“我喜欢它”,“它太糟糕了”],分类器被错误地标记为 'it is so bad' 为阳性。
输入:
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import names
positive_vocab = [ 'awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)' ]
negative_vocab = [ 'bad', 'terrible','useless', 'hate', ':(' ]
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not','it','so','really' ]
def word_feats(words):
return dict([(word, True) for word in words])
positive_features_1 = [(word_feats(positive_vocab), 'pos')]
negative_features_1 = [(word_feats(negative_vocab), 'neg')]
neutral_features_1 = [(word_feats(neutral_vocab), 'neu')]
train_set = negative_features_1 + positive_features_1 + neutral_features_1
classifier = NaiveBayesClassifier.train(train_set)
# Predict
neg = 0
pos = 0
sentence = "Awesome movie. I like it. It is so bad"
sentence = sentence.lower()
words = sentence.split('.')
def word_feat(word):
return dict([(word,True)])
#NOTE THAT THE FUNCTION 'word_feat(word)' I WROTE HERE IS DIFFERENT FROM THE 'word_feat(words)' FUNCTION I DEFINED EARLIER. THIS FUNCTION IS USED TO ITERATE OVER EACH OF THE THREE ELEMENTS IN THE LIST ['awesome movie', ' i like it', ' it is so bad'].
for word in words:
classResult = classifier.classify(word_feat(word))
if classResult == 'neg':
neg = neg + 1
if classResult == 'pos':
pos = pos + 1
print(str(word) + ' is ' + str(classResult))
print()
输出:
awesome movie is pos
i like it is pos
it is so bad is pos
为了确保函数 'word_feat(word)' 遍历每个句子而不是每个单词或字母,我做了一些诊断代码来查看 'word_feat(word)':
中的每个元素是什么
for word in words:
print(word_feat(word))
并打印出来:
{'awesome movie': True}
{' i like it': True}
{' it is so bad': True}
看来函数 'word_feat(word)' 是正确的?
有谁知道为什么分类器将 'It is so bad' 分类为阳性?如前所述,我在训练数据中明确将 'bad' 这个词标记为负数。
这个特别的失败是因为你的 word_feats()
函数需要一个单词列表(一个标记化的句子),但是你将每个单词单独传递给它......所以 word_feats()
迭代它的字母。您已经构建了一个分类器,它根据字符串包含的字母将字符串分类为正面或负面。
您可能处于这种困境,因为您没有注意变量的命名。在您的主循环中,变量 sentence
、words
或 word
中的 none 包含其名称所声明的内容。要理解和改进您的程序,请从正确命名开始。
撇开错误不谈,这不是构建情感分类器的方式。训练数据应该是标记化句子的列表(每个句子都标有其情绪),而不是单个单词的列表。同样,您对标记化的句子进行分类。
让我展示一下您的代码重写。我在顶部附近所做的更改是添加 import re
,因为使用正则表达式更容易标记化。定义 classifier
之前的所有其他内容都与您的代码相同。
我又添加了一个测试用例(确实非常负面),但更重要的是我使用了正确的变量名——这样就更难对正在发生的事情感到困惑:
test_data = "Awesome movie. I like it. It is so bad. I hate this terrible useless movie."
sentences = test_data.lower().split('.')
所以 sentences
现在包含 4 个字符串,每个字符串一个句子。
我没有改变你的 word_feat()
函数。
为了使用分类器,我做了相当大的重写:
for sentence in sentences:
if(len(sentence) == 0):continue
neg = 0
pos = 0
for word in re.findall(r"[\w']+", sentence):
classResult = classifier.classify(word_feat(word))
print(word, classResult)
if classResult == 'neg':
neg = neg + 1
if classResult == 'pos':
pos = pos + 1
print("\n%s: %d vs -%d\n"%(sentence,pos,neg))
外循环也是描述性的,所以sentence
包含一个句子。
然后我有一个内部循环,我们对句子中的每个单词进行分类;我正在使用正则表达式将句子拆分为空格和标点符号:
for word in re.findall(r"[\w']+", sentence):
classResult = classifier.classify(word_feat(word))
剩下的只是基本的加法和报告。我得到这个输出:
awesome pos
movie neu
awesome movie: 1 vs -0
i pos
like pos
it pos
i like it: 3 vs -0
it pos
is neu
so pos
bad neg
it is so bad: 2 vs -1
i pos
hate neg
this pos
terrible neg
useless neg
movie neu
i hate this terrible useless movie: 2 vs -3
我仍然和你一样 - "it is so bad" 被认为是积极的。通过额外的调试行我们可以看到这是因为 "it" 和 "so" 被认为是积极的词,而 "bad" 是唯一的消极词,所以总体上它是积极的。
我怀疑这是因为它没有在训练数据中看到这些词。
...是的,如果我将 "it" 和 "so" 添加到中性词列表中,我会得到 "it is so bad: 0 vs -1".
作为接下来要尝试的事情,我建议:
- 尝试使用更多训练数据;像这样的玩具示例存在噪声会淹没信号的风险。
- 考虑删除停用词。
这是为您修改后的代码
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import names
from nltk.corpus import stopwords
positive_vocab = [ 'awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)' ]
negative_vocab = [ 'bad', 'terrible','useless', 'hate', ':(' ]
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not','it','so','really' ]
def word_feats(words):
return dict([(word, True) for word in words])
positive_features_1 = [(word_feats(positive_vocab), 'pos')]
negative_features_1 = [(word_feats(negative_vocab), 'neg')]
neutral_features_1 = [(word_feats(neutral_vocab), 'neu')]
train_set = negative_features_1 + positive_features_1 + neutral_features_1
classifier = NaiveBayesClassifier.train(train_set)
# Predict
neg = 0
pos = 0
sentence = "Awesome movie. I like it. It is so bad."
sentence = sentence.lower()
sentences = sentence.split('.') # these are actually list of sentences
for sent in sentences:
if sent != "":
words = [word for word in sent.split(" ") if word not in stopwords.words('english')]
classResult = classifier.classify(word_feats(words))
if classResult == 'neg':
neg = neg + 1
if classResult == 'pos':
pos = pos + 1
print(str(sent) + ' --> ' + str(classResult))
print
我修改了您考虑 'list of words' 作为分类器输入的地方。但是实际上你需要一个一个地传递句子,这意味着你需要传递 'list of sentences'
此外,对于每个句子,你需要传递'words as features',这意味着你需要在白色-space字符上拆分句子。
此外,如果您希望分类器在情感分析中正常工作,您需要降低对 "stop-words" 的偏好,例如 "it, they, is etc"。因为这些词不足以决定句子是肯定的、否定的还是中性的。
以上代码给出以下输出
awesome movie --> pos
i like it --> pos
it is so bad --> neg
所以对于任何分类器,训练分类器和预测分类器的输入格式应该相同。在训练您提供单词列表时,也尝试使用相同的方法来转换您的测试集。
你可以试试这个代码
from nltk.classify import NaiveBayesClassifier
def word_feats(words):
return dict([(word, True) for word in words])
positive_vocab = [ 'awesome', 'outstanding', 'fantastic','terrific','good','nice','great', ':)','love' ]
negative_vocab = [ 'bad', 'terrible','useless','hate',':(','kill','steal']
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not' ]
positive_features = [(word_feats(pos), 'pos') for pos in positive_vocab]
negative_features = [(word_feats(neg), 'neg') for neg in negative_vocab]
neutral_features = [(word_feats(neu), 'neu') for neu in neutral_vocab]
train_set = negative_features + positive_features + neutral_features
classifier = NaiveBayesClassifier.train(train_set)
# Predict
neg = 0
pos = 0
sentence = " Awesome movie, I like it :)"
sentence = sentence.lower()
words = sentence.split(' ')
for word in words:
classResult = classifier.classify( word_feats(word))
if classResult == 'neg':
neg = neg + 1
if classResult == 'pos':
pos = pos + 1
print('Positive: ' + str(float(pos)/len(words)))
print('Negative: ' + str(float(neg)/len(words)))
结果是:
正数:0.7142857142857143
负数:0.14285714285714285
这是我第一次使用 Python 中的 nltk NaiveBayesClassifier 构建情感分析机器学习模型。我知道这个模型太简单了,但这对我来说只是第一步,下次我会尝试标记化的句子。
我当前模型的真正问题是:我已经在训练数据集中将单词 'bad' 明确标记为负数(正如您从 'negative_vocab' 变量中看到的那样)。但是,当我 运行 列表中的每个句子(小写)上的 NaiveBayesClassifier ['awesome movie',“我喜欢它”,“它太糟糕了”],分类器被错误地标记为 'it is so bad' 为阳性。
输入:
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import names
positive_vocab = [ 'awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)' ]
negative_vocab = [ 'bad', 'terrible','useless', 'hate', ':(' ]
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not','it','so','really' ]
def word_feats(words):
return dict([(word, True) for word in words])
positive_features_1 = [(word_feats(positive_vocab), 'pos')]
negative_features_1 = [(word_feats(negative_vocab), 'neg')]
neutral_features_1 = [(word_feats(neutral_vocab), 'neu')]
train_set = negative_features_1 + positive_features_1 + neutral_features_1
classifier = NaiveBayesClassifier.train(train_set)
# Predict
neg = 0
pos = 0
sentence = "Awesome movie. I like it. It is so bad"
sentence = sentence.lower()
words = sentence.split('.')
def word_feat(word):
return dict([(word,True)])
#NOTE THAT THE FUNCTION 'word_feat(word)' I WROTE HERE IS DIFFERENT FROM THE 'word_feat(words)' FUNCTION I DEFINED EARLIER. THIS FUNCTION IS USED TO ITERATE OVER EACH OF THE THREE ELEMENTS IN THE LIST ['awesome movie', ' i like it', ' it is so bad'].
for word in words:
classResult = classifier.classify(word_feat(word))
if classResult == 'neg':
neg = neg + 1
if classResult == 'pos':
pos = pos + 1
print(str(word) + ' is ' + str(classResult))
print()
输出:
awesome movie is pos
i like it is pos
it is so bad is pos
为了确保函数 'word_feat(word)' 遍历每个句子而不是每个单词或字母,我做了一些诊断代码来查看 'word_feat(word)':
中的每个元素是什么for word in words:
print(word_feat(word))
并打印出来:
{'awesome movie': True}
{' i like it': True}
{' it is so bad': True}
看来函数 'word_feat(word)' 是正确的?
有谁知道为什么分类器将 'It is so bad' 分类为阳性?如前所述,我在训练数据中明确将 'bad' 这个词标记为负数。
这个特别的失败是因为你的 word_feats()
函数需要一个单词列表(一个标记化的句子),但是你将每个单词单独传递给它......所以 word_feats()
迭代它的字母。您已经构建了一个分类器,它根据字符串包含的字母将字符串分类为正面或负面。
您可能处于这种困境,因为您没有注意变量的命名。在您的主循环中,变量 sentence
、words
或 word
中的 none 包含其名称所声明的内容。要理解和改进您的程序,请从正确命名开始。
撇开错误不谈,这不是构建情感分类器的方式。训练数据应该是标记化句子的列表(每个句子都标有其情绪),而不是单个单词的列表。同样,您对标记化的句子进行分类。
让我展示一下您的代码重写。我在顶部附近所做的更改是添加 import re
,因为使用正则表达式更容易标记化。定义 classifier
之前的所有其他内容都与您的代码相同。
我又添加了一个测试用例(确实非常负面),但更重要的是我使用了正确的变量名——这样就更难对正在发生的事情感到困惑:
test_data = "Awesome movie. I like it. It is so bad. I hate this terrible useless movie."
sentences = test_data.lower().split('.')
所以 sentences
现在包含 4 个字符串,每个字符串一个句子。
我没有改变你的 word_feat()
函数。
为了使用分类器,我做了相当大的重写:
for sentence in sentences:
if(len(sentence) == 0):continue
neg = 0
pos = 0
for word in re.findall(r"[\w']+", sentence):
classResult = classifier.classify(word_feat(word))
print(word, classResult)
if classResult == 'neg':
neg = neg + 1
if classResult == 'pos':
pos = pos + 1
print("\n%s: %d vs -%d\n"%(sentence,pos,neg))
外循环也是描述性的,所以sentence
包含一个句子。
然后我有一个内部循环,我们对句子中的每个单词进行分类;我正在使用正则表达式将句子拆分为空格和标点符号:
for word in re.findall(r"[\w']+", sentence):
classResult = classifier.classify(word_feat(word))
剩下的只是基本的加法和报告。我得到这个输出:
awesome pos
movie neu
awesome movie: 1 vs -0
i pos
like pos
it pos
i like it: 3 vs -0
it pos
is neu
so pos
bad neg
it is so bad: 2 vs -1
i pos
hate neg
this pos
terrible neg
useless neg
movie neu
i hate this terrible useless movie: 2 vs -3
我仍然和你一样 - "it is so bad" 被认为是积极的。通过额外的调试行我们可以看到这是因为 "it" 和 "so" 被认为是积极的词,而 "bad" 是唯一的消极词,所以总体上它是积极的。
我怀疑这是因为它没有在训练数据中看到这些词。
...是的,如果我将 "it" 和 "so" 添加到中性词列表中,我会得到 "it is so bad: 0 vs -1".
作为接下来要尝试的事情,我建议:
- 尝试使用更多训练数据;像这样的玩具示例存在噪声会淹没信号的风险。
- 考虑删除停用词。
这是为您修改后的代码
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import names
from nltk.corpus import stopwords
positive_vocab = [ 'awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)' ]
negative_vocab = [ 'bad', 'terrible','useless', 'hate', ':(' ]
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not','it','so','really' ]
def word_feats(words):
return dict([(word, True) for word in words])
positive_features_1 = [(word_feats(positive_vocab), 'pos')]
negative_features_1 = [(word_feats(negative_vocab), 'neg')]
neutral_features_1 = [(word_feats(neutral_vocab), 'neu')]
train_set = negative_features_1 + positive_features_1 + neutral_features_1
classifier = NaiveBayesClassifier.train(train_set)
# Predict
neg = 0
pos = 0
sentence = "Awesome movie. I like it. It is so bad."
sentence = sentence.lower()
sentences = sentence.split('.') # these are actually list of sentences
for sent in sentences:
if sent != "":
words = [word for word in sent.split(" ") if word not in stopwords.words('english')]
classResult = classifier.classify(word_feats(words))
if classResult == 'neg':
neg = neg + 1
if classResult == 'pos':
pos = pos + 1
print(str(sent) + ' --> ' + str(classResult))
print
我修改了您考虑 'list of words' 作为分类器输入的地方。但是实际上你需要一个一个地传递句子,这意味着你需要传递 'list of sentences'
此外,对于每个句子,你需要传递'words as features',这意味着你需要在白色-space字符上拆分句子。
此外,如果您希望分类器在情感分析中正常工作,您需要降低对 "stop-words" 的偏好,例如 "it, they, is etc"。因为这些词不足以决定句子是肯定的、否定的还是中性的。
以上代码给出以下输出
awesome movie --> pos
i like it --> pos
it is so bad --> neg
所以对于任何分类器,训练分类器和预测分类器的输入格式应该相同。在训练您提供单词列表时,也尝试使用相同的方法来转换您的测试集。
你可以试试这个代码
from nltk.classify import NaiveBayesClassifier
def word_feats(words):
return dict([(word, True) for word in words])
positive_vocab = [ 'awesome', 'outstanding', 'fantastic','terrific','good','nice','great', ':)','love' ]
negative_vocab = [ 'bad', 'terrible','useless','hate',':(','kill','steal']
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not' ]
positive_features = [(word_feats(pos), 'pos') for pos in positive_vocab]
negative_features = [(word_feats(neg), 'neg') for neg in negative_vocab]
neutral_features = [(word_feats(neu), 'neu') for neu in neutral_vocab]
train_set = negative_features + positive_features + neutral_features
classifier = NaiveBayesClassifier.train(train_set)
# Predict
neg = 0
pos = 0
sentence = " Awesome movie, I like it :)"
sentence = sentence.lower()
words = sentence.split(' ')
for word in words:
classResult = classifier.classify( word_feats(word))
if classResult == 'neg':
neg = neg + 1
if classResult == 'pos':
pos = pos + 1
print('Positive: ' + str(float(pos)/len(words)))
print('Negative: ' + str(float(neg)/len(words)))
结果是: 正数:0.7142857142857143 负数:0.14285714285714285