Python scikit-learn 预测失败

Question

我是 Python 和机器学习的新手。我尝试实现一个简单的机器学习脚本来预测文本的主题，例如关于巴拉克奥巴马的文字应该映射到政治家。

我认为我这样做是正确的，但我不是 100% 确定所以我问你们。

首先是我的小脚本：

#imports
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
#dictionary for mapping the targets
categories_dict = {'0' : 'politiker','1' : 'nonprofit org'}

import glob
#get filenames from docs
filepaths = glob.glob('Data/*.txt')
print(filepaths)

docs = []

for path in filepaths:
doc = open(path,'r')
docs.append(doc.read())
#print docs


count_vect = CountVectorizer()
#train Data
X_train_count = count_vect.fit_transform(docs)
#print X_train_count.shape

#tfidf transformation (occurences to frequencys)
tfdif_transform = TfidfTransformer()
X_train_tfidf = tfdif_transform.fit_transform(X_train_count)

#get the categories you want to predict in a set, these must be in the order the train        docs are!
categories = ['0','0','0','1','1']
clf = MultinomialNB().fit(X_train_tfidf,categories)

#try to predict
to_predict = ['Barack Obama is the President of the United States','Greenpeace']

#transform(not fit_transform) the new data you want to predict
X_pred_counts = count_vect.transform(to_predict)
X_pred_tfidf = tfdif_transform.transform(X_pred_counts)
print X_pred_tfidf

#predict
predicted = clf.predict(X_pred_tfidf)

for doc,category in zip(to_predict,predicted):
    print('%r => %s' %(doc,categories_dict[category]))

我确定使用它所需的一般工作流程，但我不确定我如何将类别映射到我用来训练分类器的文档。我知道它们必须按正确的顺序排列，我想我明白了，但它没有输出正确的类别。

那是因为我用来训练分类器的文档不好，还是我犯了某个我不知道的错误？

他预测这两个新文本都是关于目标 0（政治家）的

提前致谢。

Answer 1

看起来模型超参数没有正确调整。这么少的数据很难下结论，但如果你使用：

model = MultinomialNB(0.5).fit(X, y)
# or
model = LogisticRegression().fit(X, y)

你会得到预期的结果，至少对于像 "Greenpeace"、"Obama"、"President" 这样的词来说是这样，这些词与其对应的 class 有如此明显的相关性。我快速查看了模型的系数，它似乎在做正确的事情。

要了解更复杂的主题建模方法，我建议您查看 gensim。

Python scikit-learn 预测失败

Python scikit-learn Predictionfail

python

machine-learning

scikit-learn