如何在 Python 中使用 NLTK NaiveBayesClassifier 训练和测试模型后预测情绪?

How to predict Sentiments after training and testing the model by using NLTK NaiveBayesClassifier in Python?

我正在使用 NLTK NaiveBayesClassifier 进行情感分类。我使用标记数据训练和测试模型。现在我想预测未标记数据的情绪。但是,我进入了运行的错误。 给出错误的行是:

score_1 = analyzer.evaluate(list(zip(new_data['Articles'])))

错误是:

ValueError: not enough values to unpack (expected 2, got 1)

代码如下:

import random
import pandas as pd
data = pd.read_csv("label data for testing .csv", header=0)
sentiment_data = list(zip(data['Articles'], data['Sentiment']))
random.shuffle(sentiment_data)
new_data = pd.read_csv("Japan Data.csv", header=0)
train_x, train_y = zip(*sentiment_data[:350])
test_x, test_y = zip(*sentiment_data[350:])

from unidecode import unidecode
from nltk import word_tokenize
from nltk.classify import NaiveBayesClassifier
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.util import extract_unigram_feats

TRAINING_COUNT = 350


def clean_text(text):
    text = text.replace("<br />", " ")

    return text


analyzer = SentimentAnalyzer()
vocabulary = analyzer.all_words([(word_tokenize(unidecode(clean_text(instance))))
                                 for instance in train_x[:TRAINING_COUNT]])
print("Vocabulary: ", len(vocabulary))

print("Computing Unigran Features ...")

unigram_features = analyzer.unigram_word_feats(vocabulary, min_freq=10)

print("Unigram Features: ", len(unigram_features))

analyzer.add_feat_extractor(extract_unigram_feats, unigrams=unigram_features)

# Build the training set
_train_X = analyzer.apply_features([(word_tokenize(unidecode(clean_text(instance))))
                                    for instance in train_x[:TRAINING_COUNT]], labeled=False)

# Build the test set
_test_X = analyzer.apply_features([(word_tokenize(unidecode(clean_text(instance))))
                                   for instance in test_x], labeled=False)

trainer = NaiveBayesClassifier.train
classifier = analyzer.train(trainer, zip(_train_X, train_y[:TRAINING_COUNT]))

score = analyzer.evaluate(list(zip(_test_X, test_y)))
print("Accuracy: ", score['Accuracy'])

score_1 = analyzer.evaluate(list(zip(new_data['Articles'])))
print(score_1)

我知道出现问题是因为我必须给出两个参数是出错的行,但我不知道该怎么做。

提前致谢。

文档和示例

给您错误的行调用方法 SentimentAnalyzer.evaluate(...) 。 此方法执行以下操作。

Evaluate and print classifier performance on the test set.

SentimentAnalyzer.evaluate

该方法有一个强制参数:test_set .

test_set – A list of (tokens, label) tuples to use as gold set.

http://www.nltk.org/howto/sentiment.html 的示例中 test_set 具有以下结构:

[({'contains(,)': False, 'contains(.)': True, 'contains(and)': False, 'contains(the)': True}, 'subj'), ({'contains(,)': True, 'contains(.)': True, 'contains(and)': False, 'contains(the)': True}, 'subj'), ...]

这里是结构的符号表示。

[(dictionary,label), ... , (dictionary,label)]

您的代码有误

你通过了

list(zip(new_data['Articles']))

到SentimentAnalyzer.evaluate。我假设你得到了错误,因为

list(zip(new_data['Articles']))

不创建(标记、标签)元组列表。您可以通过创建一个包含列表的变量并打印它或在调试时查看变量的值来检查这一点。 例如

test_set = list(zip(new_data['Articles']))
print("begin test_set")
print(test_set)
print("end test_set")

您调用的 evaluate 正确地比给出错误的行高 3 行。

score = analyzer.evaluate(list(zip(_test_X, test_y)))

我猜您想调用 SentimentAnalyzer.classify(instance) 来预测未标记的数据。参见 SentimentAnalyzer.classify