用于文本分类的 nltk naivebayes 分类器

nltk naivebayes classifier for text classification

在下面的代码中,我知道我的 naivebayes 分类器工作正常,因为它在 trainset1 上工作正常,但为什么它在 trainset2 上不工作?我什至在两个分类器上尝试过,一个来自 TextBlob,另一个直接来自 nltk。

from textblob.classifiers import NaiveBayesClassifier
from textblob import TextBlob
from nltk.tokenize import word_tokenize
import nltk

trainset1 = [('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg')]

trainset2 = [('hide all brazil and everything plan limps to anniversary inflation plan initiallyis limping its first anniversary amid soaring prices', 'class1'),
         ('hello i was there and no one came', 'class2'),
         ('all negative terms like sad angry etc', 'class2')]

def nltk_naivebayes(trainset, test_sentence):
    all_words = set(word.lower() for passage in trainset for word in word_tokenize(passage[0]))
    t = [({word: (word in word_tokenize(x[0])) for word in all_words}, x[1]) for x in trainset]
    classifier = nltk.NaiveBayesClassifier.train(t)
    test_sent_features = {word.lower(): (word in word_tokenize(test_sentence.lower())) for word in all_words}
    return classifier.classify(test_sent_features)

def textblob_naivebayes(trainset, test_sentence):
    cl = NaiveBayesClassifier(trainset)
    blob = TextBlob(test_sentence,classifier=cl)
    return blob.classify() 

test_sentence1 = "he is my horrible enemy"
test_sentence2 = "inflation soaring limps to anniversary"

print nltk_naivebayes(trainset1, test_sentence1)
print nltk_naivebayes(trainset2, test_sentence2)
print textblob_naivebayes(trainset1, test_sentence1)
print textblob_naivebayes(trainset2, test_sentence2)

输出:

neg
class2
neg
class2

虽然test_sentence2明明属于class1.

我假设您理解您不能指望 classifier 仅使用 3 个示例就可以学习一个好的模型,并且您的问题更多是理解为什么在这个特定示例中这样做。

它这样做的可能原因是朴素贝叶斯 classifier 使用先验 class 概率。也就是说,neg 与 pos 的概率,与文本无关。在你的例子中,2/3 的例子是负面的,因此先验是负的 66% 和正的 33%。您的单个正面实例中的正面词是 'anniversary' 和 'soaring',它们不太可能足以补偿此先验 class 概率。

特别要注意,单词概率的计算涉及各种 'smoothing' 函数(例如,每个 class 中的函数为 log10(Term Frequency + 1),而不是 log10(Term频率)以防止低频词对 class 化结果、除以零等影响太大。因此 "anniversary" 和 "soaring" 的概率对于 neg 不是 0.0,对于 pos 不是 1.0 ,与您的预期不同。