在字符串列表上迭代朴素贝叶斯分类器

Question

这是一个 NLP 问题，希望有人能帮助我。专门尝试做情绪分析。

我有一个朴素贝叶斯分类器，该分类器已根据著名的推文数据集进行训练，这些推文被标记为正面或负面：

#convert tokens to a dictionary for NB classifier:
def get_tweets_for_model(cleaned_tokens_list):
    for tweet_tokens in cleaned_tokens_list:
        yield dict([token, True] for token in tweet_tokens)
    
pos_model_tokens = get_tweets_for_model(pos_clean_token)
neg_model_tokens = get_tweets_for_model(neg_clean_token)

#prepare training data
positive_dataset = [(tweet_dict, "Positive")
                    for tweet_dict in pos_model_tokens]
negative_dataset = [(tweet_dict, "Negative")
                    for tweet_dict in neg_model_tokens]

dataset = positive_dataset + negative_dataset

#shuffle so all positive tweets aren't first
random.shuffle(dataset) 

#set apart 7000 for training, 3000 for testing
train_data = dataset[:7000]  
test_data = dataset[7000:]

#train model
classifier = NaiveBayesClassifier.train(train_data)

使用这个模型，我想遍历测试数据列表并增加每个标记的计数，无论它被分类为正面还是负面。测试数据是一个字符串列表，取自短信数据集。

print(messages[-5:])
>>>["I'm outside, waiting.", 'Have a great day :) See you soon!', "I'll be at work so I can't make it, sry!", 'Are you doing anything this weekend?', 'Thanks for dropping that stuff off :)']

我可以获取单个消息的分类：

print(classifier.classify(dict([message, True] for message in 
messages[65])))
>>>>Positive

我可以return分类的布尔值是负还是正：

neg = (classifier.classify(dict([message, True] for message in messages[65])) == "Negative")

该消息是肯定的，所以 neg 设置为 False。所以我想遍历消息列表中的所有消息，如果是正则增加正计数器的计数，如果是负则增加负计数器的计数。但是我这样做的尝试要么只将正计数器增加 1，要么只为整组标记增加正计数器，即使分类器对单个标记执行 return“负”。这是我尝试过的：

positive_tally = 0
negative_tally = 0

#increments positive_tally by 1
if (classifier.classify(dict([message, True] for message in messages)) == "Positive") == True:
    positive_tally += 1
else:
    negative_tally += 1

#increments positive_tally by 3749 (length of messages list)
for token in tokens:
    if (classifier.classify(dict([message, True] for message in 
messages)) == "Positive") == True:
        positive_tally += 1
    else:
        negative_tally += 1

对此有什么想法吗？我真的很感激。如果需要，我可以提供更多信息。

Answer 1

好的，我知道了，为后人发帖以防其他人遇到类似问题。

基本上，分类器获取一个字符串并评估字符串中的每个单词以进行分类。但我想迭代一个字符串列表。所以不是我一直在尝试的...

#didn't get what I wanted
for message in messages:
    if (classifier.classify(dict([message, True] for message in messages))) == "Positive":
        positive_tally += 1
    else: negative_tally += 1

...尝试（但失败了）对每条消息进行分类，即整个字符串，我必须确保它正在检查每条消息中的每个单词：

#works and increases tally as desired!
for message in messages:
    if classifier.classify(dict([token, True] for token in message)) == "Positive":
        us_pos_tally += 1
    else:
        us_neg_tally += 1

因此，您在 for message in messages 中从列表级别转到字符串级别，然后在分类器调用中从字符串级别转到单词级别：dict([token, True] for token in message.

在字符串列表上迭代朴素贝叶斯分类器

Iterate Naive Bayes classifier over a list of strings

python

nltk

sentiment-analysis

naivebayes