sklearn - 从文本文档预测多标签分类中的前 3-4 个标签
sklearn - predict top 3-4 labels in multi-label classifications from text documents
我目前有一个分类器 MultinomialNB()
使用 CountVectorizer
设置用于从文本文档中提取特征,虽然效果很好,但我想使用相同的方法来预测前 3- 4 个标签,不仅仅是最上面的一个。
主要原因是有 c.90 标签和数据输入不是很好,导致最高估计的准确度为 35%。如果我可以向用户提供前 3-4 个最有可能的标签作为建议,那么我可以显着提高准确率覆盖率。
有什么建议吗?任何指针将不胜感激!
当前代码如下:
import numpy
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import KFold
from sklearn.metrics import confusion_matrix, accuracy_score
df = pd.read_csv("data/corpus.csv", sep=",", encoding="latin-1")
df = df.set_index('id')
df.columns = ['class', 'text']
data = df.reindex(numpy.random.permutation(df.index))
pipeline = Pipeline([
('count_vectorizer', CountVectorizer(ngram_range=(1, 2))),
('classifier', MultinomialNB())
])
k_fold = KFold(n=len(data), n_folds=6, shuffle=True)
for train_indices, test_indices in k_fold:
train_text = data.iloc[train_indices]['text'].values
train_y = data.iloc[train_indices]['class'].values.astype(str)
test_text = data.iloc[test_indices]['text'].values
test_y = data.iloc[test_indices]['class'].values.astype(str)
pipeline.fit(train_text, train_y)
predictions = pipeline.predict(test_text)
confusion = confusion_matrix(test_y, predictions)
accuracy = accuracy_score(test_y, predictions)
print accuracy
完成预测后,您可以获得每个标签的概率:
labels_probability = pipeline.predict_proba(test_text)
您将获得每个标签的概率。见 http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline.predict_proba
要获取前 N 个标签,只需执行以下操作:
import numpy as np
n = 3
top_n_predictions = np.argsort(probas, axis=1)[:, -n:]
我目前有一个分类器 MultinomialNB()
使用 CountVectorizer
设置用于从文本文档中提取特征,虽然效果很好,但我想使用相同的方法来预测前 3- 4 个标签,不仅仅是最上面的一个。
主要原因是有 c.90 标签和数据输入不是很好,导致最高估计的准确度为 35%。如果我可以向用户提供前 3-4 个最有可能的标签作为建议,那么我可以显着提高准确率覆盖率。
有什么建议吗?任何指针将不胜感激!
当前代码如下:
import numpy
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import KFold
from sklearn.metrics import confusion_matrix, accuracy_score
df = pd.read_csv("data/corpus.csv", sep=",", encoding="latin-1")
df = df.set_index('id')
df.columns = ['class', 'text']
data = df.reindex(numpy.random.permutation(df.index))
pipeline = Pipeline([
('count_vectorizer', CountVectorizer(ngram_range=(1, 2))),
('classifier', MultinomialNB())
])
k_fold = KFold(n=len(data), n_folds=6, shuffle=True)
for train_indices, test_indices in k_fold:
train_text = data.iloc[train_indices]['text'].values
train_y = data.iloc[train_indices]['class'].values.astype(str)
test_text = data.iloc[test_indices]['text'].values
test_y = data.iloc[test_indices]['class'].values.astype(str)
pipeline.fit(train_text, train_y)
predictions = pipeline.predict(test_text)
confusion = confusion_matrix(test_y, predictions)
accuracy = accuracy_score(test_y, predictions)
print accuracy
完成预测后,您可以获得每个标签的概率:
labels_probability = pipeline.predict_proba(test_text)
您将获得每个标签的概率。见 http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline.predict_proba
要获取前 N 个标签,只需执行以下操作:
import numpy as np
n = 3
top_n_predictions = np.argsort(probas, axis=1)[:, -n:]