查找正确和错误分类的数据
Finding correctly and incorrectly classified data
我想找到应用多项式尼维斯贝叶斯分类算法后分类成功和未分类的原始数据。
例如,在应用 Multinomail 朴素贝叶斯分类后,我得到的准确率为 88%。
我想知道 12% 的未分类数据和 88% 的已分类数据。
提前致谢
我的数据集:
+----------------------+------------+
| Details | Category |
+----------------------+------------+
| Any raw text1 | cat1 |
+----------------------+------------+
| any raw text2 | cat1 |
+----------------------+------------+
| any raw text5 | cat2 |
+----------------------+------------+
| any raw text7 | cat1 |
+----------------------+------------+
| any raw text8 | cat2 |
+----------------------+------------+
| Any raw text4 | cat4 |
+----------------------+------------+
| any raw text5 | cat4 |
+----------------------+------------+
| any raw text6 | cat3 |
+----------------------+------------+
我的代码:
import pandas as pd
import numpy as np
import scipy as sp
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
data= pd.read_csv('mydat.xls', delimiter='\t',usecols=
['Details','Category'],encoding='utf-8')
target_one=data['Category']
target_list=data['Category'].unique()
x_train, x_test, y_train, y_test = train_test_split(data.Details,
data.Category, random_state=42)
vect = CountVectorizer(ngram_range=(1,2))
#converting traning features into numeric vector
X_train = vect.fit_transform(x_train.values.astype('U'))
#converting training labels into numeric vector
X_test = vect.transform(x_test.values.astype('U'))
# start = time.clock()
mnb = MultinomialNB(alpha =0.13)
mnb.fit(X_train,y_train)
result= mnb.predict(X_test)
# mnb.predict_proba(x_test)[0:10,1]
accuracy_score(result,y_test)
只需遍历您的数据:
for i in range(len(y_test)):
if result[i] == y_test[i]:
print("CORRECT: ", X_test[i])
else
print("INCORRECT: ", X_test[i])
您可以将它们添加到两个不同的列表或只打印 ID 或做任何您想做的事。
我想找到应用多项式尼维斯贝叶斯分类算法后分类成功和未分类的原始数据。 例如,在应用 Multinomail 朴素贝叶斯分类后,我得到的准确率为 88%。 我想知道 12% 的未分类数据和 88% 的已分类数据。 提前致谢
我的数据集:
+----------------------+------------+
| Details | Category |
+----------------------+------------+
| Any raw text1 | cat1 |
+----------------------+------------+
| any raw text2 | cat1 |
+----------------------+------------+
| any raw text5 | cat2 |
+----------------------+------------+
| any raw text7 | cat1 |
+----------------------+------------+
| any raw text8 | cat2 |
+----------------------+------------+
| Any raw text4 | cat4 |
+----------------------+------------+
| any raw text5 | cat4 |
+----------------------+------------+
| any raw text6 | cat3 |
+----------------------+------------+
我的代码:
import pandas as pd
import numpy as np
import scipy as sp
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
data= pd.read_csv('mydat.xls', delimiter='\t',usecols=
['Details','Category'],encoding='utf-8')
target_one=data['Category']
target_list=data['Category'].unique()
x_train, x_test, y_train, y_test = train_test_split(data.Details,
data.Category, random_state=42)
vect = CountVectorizer(ngram_range=(1,2))
#converting traning features into numeric vector
X_train = vect.fit_transform(x_train.values.astype('U'))
#converting training labels into numeric vector
X_test = vect.transform(x_test.values.astype('U'))
# start = time.clock()
mnb = MultinomialNB(alpha =0.13)
mnb.fit(X_train,y_train)
result= mnb.predict(X_test)
# mnb.predict_proba(x_test)[0:10,1]
accuracy_score(result,y_test)
只需遍历您的数据:
for i in range(len(y_test)):
if result[i] == y_test[i]:
print("CORRECT: ", X_test[i])
else
print("INCORRECT: ", X_test[i])
您可以将它们添加到两个不同的列表或只打印 ID 或做任何您想做的事。