NLTK 朴素贝叶斯分类器训练问题
NLTK Naive Bayes Classifier Training issues
我正在尝试训练推文分类器。但是,问题在于它表示分类器具有 100% 的准确度,而信息最丰富的功能列表没有显示任何内容。有谁知道我做错了什么?我相信我对分类器的所有输入都是正确的,所以我不知道哪里出了问题。
这是我正在使用的数据集:
http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip
这是我的代码:
import nltk
import random
file = open('Train/train.txt', 'r')
documents = []
all_words = [] #TODO remove punctuation?
INPUT_TWEETS = 3000
print("Preprocessing...")
for line in (file):
# Tokenize Tweet content
tweet_words = nltk.word_tokenize(line[2:])
sentiment = ""
if line[0] == 0:
sentiment = "negative"
else:
sentiment = "positive"
documents.append((tweet_words, sentiment))
for word in tweet_words:
all_words.append(word.lower())
INPUT_TWEETS = INPUT_TWEETS - 1
if INPUT_TWEETS == 0:
break
random.shuffle(documents)
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:3000] #top 3000 words
def find_features(document):
words = set(document)
features = {}
for w in word_features:
features[w] = (w in words)
return features
#Categorize as positive or Negative
feature_set = [(find_features(all_words), sentiment) for (all_words, sentment) in documents]
training_set = feature_set[:1000]
testing_set = feature_set[1000:]
print("Training...")
classifier = nltk.NaiveBayesClassifier.train(training_set)
print("Naive Bayes Accuracy:", (nltk.classify.accuracy(classifier,testing_set))*100)
classifier.show_most_informative_features(15)
您的代码中有错别字:
feature_set = [(find_features(all_words), sentiment) for (all_words, sentment) in documents ]
这会导致 sentiment
始终具有相同的值(即来自预处理步骤的最后一条推文的值),因此训练毫无意义并且所有特征都不相关。
修复它你将得到:
('Naive Bayes Accuracy:', 66.75)
Most Informative Features
-- = True positi : negati = 6.9 : 1.0
these = True positi : negati = 5.6 : 1.0
face = True positi : negati = 5.6 : 1.0
saw = True positi : negati = 5.6 : 1.0
] = True positi : negati = 4.4 : 1.0
later = True positi : negati = 4.4 : 1.0
love = True positi : negati = 4.1 : 1.0
ta = True positi : negati = 4.0 : 1.0
quite = True positi : negati = 4.0 : 1.0
trying = True positi : negati = 4.0 : 1.0
small = True positi : negati = 4.0 : 1.0
thx = True positi : negati = 4.0 : 1.0
music = True positi : negati = 4.0 : 1.0
p = True positi : negati = 4.0 : 1.0
husband = True positi : negati = 4.0 : 1.0
我正在尝试训练推文分类器。但是,问题在于它表示分类器具有 100% 的准确度,而信息最丰富的功能列表没有显示任何内容。有谁知道我做错了什么?我相信我对分类器的所有输入都是正确的,所以我不知道哪里出了问题。
这是我正在使用的数据集: http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip
这是我的代码:
import nltk
import random
file = open('Train/train.txt', 'r')
documents = []
all_words = [] #TODO remove punctuation?
INPUT_TWEETS = 3000
print("Preprocessing...")
for line in (file):
# Tokenize Tweet content
tweet_words = nltk.word_tokenize(line[2:])
sentiment = ""
if line[0] == 0:
sentiment = "negative"
else:
sentiment = "positive"
documents.append((tweet_words, sentiment))
for word in tweet_words:
all_words.append(word.lower())
INPUT_TWEETS = INPUT_TWEETS - 1
if INPUT_TWEETS == 0:
break
random.shuffle(documents)
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:3000] #top 3000 words
def find_features(document):
words = set(document)
features = {}
for w in word_features:
features[w] = (w in words)
return features
#Categorize as positive or Negative
feature_set = [(find_features(all_words), sentiment) for (all_words, sentment) in documents]
training_set = feature_set[:1000]
testing_set = feature_set[1000:]
print("Training...")
classifier = nltk.NaiveBayesClassifier.train(training_set)
print("Naive Bayes Accuracy:", (nltk.classify.accuracy(classifier,testing_set))*100)
classifier.show_most_informative_features(15)
您的代码中有错别字:
feature_set = [(find_features(all_words), sentiment) for (all_words, sentment) in documents ]
这会导致 sentiment
始终具有相同的值(即来自预处理步骤的最后一条推文的值),因此训练毫无意义并且所有特征都不相关。
修复它你将得到:
('Naive Bayes Accuracy:', 66.75)
Most Informative Features
-- = True positi : negati = 6.9 : 1.0
these = True positi : negati = 5.6 : 1.0
face = True positi : negati = 5.6 : 1.0
saw = True positi : negati = 5.6 : 1.0
] = True positi : negati = 4.4 : 1.0
later = True positi : negati = 4.4 : 1.0
love = True positi : negati = 4.1 : 1.0
ta = True positi : negati = 4.0 : 1.0
quite = True positi : negati = 4.0 : 1.0
trying = True positi : negati = 4.0 : 1.0
small = True positi : negati = 4.0 : 1.0
thx = True positi : negati = 4.0 : 1.0
music = True positi : negati = 4.0 : 1.0
p = True positi : negati = 4.0 : 1.0
husband = True positi : negati = 4.0 : 1.0