如何用sci-kit learn识别误分类文本文件的ID/name/title
How to identify the ID / name / title of the misclassified text file with sci-kit learn
我正在构建自己的文本分类分类器,但目前我正在玩 sci-kit learn 以弄清楚一些事情。我使用 NB 分类器对我的一些文本文件进行了分类。我正在使用 26 个文本文件,这些文件被手动分为 2 个类别,每个文件的编号都在 01 - 26 之间(即“01.txt”等)。
我的代码和结果:
import sklearn
from sklearn.datasets import load_files
import numpy as np
bunch = load_files('corpus')
split_pcnt = 0.75
split_size = int(len(bunch.data) * split_pcnt)
X_train = bunch.data[:split_size]
X_test = bunch.data[split_size:]
y_train = bunch.target[:split_size]
y_test = bunch.target[split_size:]
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer, CountVectorizer
clf_1 = Pipeline([('vect', CountVectorizer()),
('clf', MultinomialNB()),
])
from sklearn.cross_validation import cross_val_score, KFold
from scipy.stats import sem
def evaluate_cross_validation(clf, X, y, K):
# create a k-fold croos validation iterator of k=5 folds
cv = KFold(len(y), K, shuffle=True, random_state=0)
# by default the score used is the one returned by score >>> method of the estimator (accuracy)
scores = cross_val_score(clf, X, y, cv=cv)
print scores
print ("Mean score: {0:.3f} (+/-{1:.3f})").format(np.mean(scores), sem(scores))
clfs = [clf_1]
for clf in clfs:
evaluate_cross_validation(clf, bunch.data, bunch.target, 5)
[ 0.5 0.4 0.4 0.4 0.6]
Mean score: 0.460 (+/-0.040)
from sklearn import metrics
def train_and_evaluate(clf, X_train, X_test, y_train, y_test):
clf.fit(X_train, y_train)
print "Accuracy on training set:"
print clf.score(X_train, y_train)
print "Accuracy on testing set:"
print clf.score(X_test, y_test)
y_pred = clf.predict(X_test)
print "Classification Report:"
print metrics.classification_report(y_test, y_pred)
print "Confusion Matrix:"
print metrics.confusion_matrix(y_test, y_pred)
train_and_evaluate(clf_1, X_train, X_test, y_train, y_test)
Accuracy on training set:
1.0
Accuracy on testing set:
0.714285714286
Classification Report:
precision recall f1-score support
0 0.67 0.67 0.67 3
1 0.75 0.75 0.75 4
avg / total 0.71 0.71 0.71 7
Confusion Matrix:
[[2 1]
[1 3]]
我无法弄清楚的是如何识别错误分类文件的 ID,以查看哪些文件被错误分类(例如“05.txt”和“23.txt”)。这完全可以通过 sci-kit learn 实现吗?
最佳,
古兹德
假设 load_files
按字母顺序加载文本文件,您所需要的只是被错误分类的示例的索引。这可以通过以下方式获得:
misclassified = np.where(y_pred != y_test)
print(misclassified)
在 train_and_evaluate
函数的末尾。因此,如果打印出来,假设 [1, 3, 7]
,文件“01.txt”、“03.txt”和“07.txt”被错误分类。
是的,您必须使用 load_files 结果的属性 filenames
。
但是您的示例代码中有两个模型训练和评估周期:一个使用 CV,另一个使用简单的训练-测试拆分。
在训练-测试拆分中:
test_filenames = bunch.filenames[split_size:]
misclassified = (y_pred != y_test)
print test_filenames[misscalssified]
此答案不假定文本文件按字母顺序排列或所有数字都存在。
我正在构建自己的文本分类分类器,但目前我正在玩 sci-kit learn 以弄清楚一些事情。我使用 NB 分类器对我的一些文本文件进行了分类。我正在使用 26 个文本文件,这些文件被手动分为 2 个类别,每个文件的编号都在 01 - 26 之间(即“01.txt”等)。
我的代码和结果:
import sklearn
from sklearn.datasets import load_files
import numpy as np
bunch = load_files('corpus')
split_pcnt = 0.75
split_size = int(len(bunch.data) * split_pcnt)
X_train = bunch.data[:split_size]
X_test = bunch.data[split_size:]
y_train = bunch.target[:split_size]
y_test = bunch.target[split_size:]
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer, CountVectorizer
clf_1 = Pipeline([('vect', CountVectorizer()),
('clf', MultinomialNB()),
])
from sklearn.cross_validation import cross_val_score, KFold
from scipy.stats import sem
def evaluate_cross_validation(clf, X, y, K):
# create a k-fold croos validation iterator of k=5 folds
cv = KFold(len(y), K, shuffle=True, random_state=0)
# by default the score used is the one returned by score >>> method of the estimator (accuracy)
scores = cross_val_score(clf, X, y, cv=cv)
print scores
print ("Mean score: {0:.3f} (+/-{1:.3f})").format(np.mean(scores), sem(scores))
clfs = [clf_1]
for clf in clfs:
evaluate_cross_validation(clf, bunch.data, bunch.target, 5)
[ 0.5 0.4 0.4 0.4 0.6]
Mean score: 0.460 (+/-0.040)
from sklearn import metrics
def train_and_evaluate(clf, X_train, X_test, y_train, y_test):
clf.fit(X_train, y_train)
print "Accuracy on training set:"
print clf.score(X_train, y_train)
print "Accuracy on testing set:"
print clf.score(X_test, y_test)
y_pred = clf.predict(X_test)
print "Classification Report:"
print metrics.classification_report(y_test, y_pred)
print "Confusion Matrix:"
print metrics.confusion_matrix(y_test, y_pred)
train_and_evaluate(clf_1, X_train, X_test, y_train, y_test)
Accuracy on training set:
1.0
Accuracy on testing set:
0.714285714286
Classification Report:
precision recall f1-score support
0 0.67 0.67 0.67 3
1 0.75 0.75 0.75 4
avg / total 0.71 0.71 0.71 7
Confusion Matrix:
[[2 1]
[1 3]]
我无法弄清楚的是如何识别错误分类文件的 ID,以查看哪些文件被错误分类(例如“05.txt”和“23.txt”)。这完全可以通过 sci-kit learn 实现吗?
最佳,
古兹德
假设 load_files
按字母顺序加载文本文件,您所需要的只是被错误分类的示例的索引。这可以通过以下方式获得:
misclassified = np.where(y_pred != y_test)
print(misclassified)
在 train_and_evaluate
函数的末尾。因此,如果打印出来,假设 [1, 3, 7]
,文件“01.txt”、“03.txt”和“07.txt”被错误分类。
是的,您必须使用 load_files 结果的属性 filenames
。
但是您的示例代码中有两个模型训练和评估周期:一个使用 CV,另一个使用简单的训练-测试拆分。
在训练-测试拆分中:
test_filenames = bunch.filenames[split_size:]
misclassified = (y_pred != y_test)
print test_filenames[misscalssified]
此答案不假定文本文件按字母顺序排列或所有数字都存在。