Sklearn 准确度分数与朴素贝叶斯分类器的输出结果不匹配

Question

我有以下场景：我需要从字符串列表（其中 500,000 个）中区分哪些字符串与企业相关，哪些是人。

问题的还原示例：

Whosebug LLC -> 业务
李四 -> 人物
John Doe Inc. -> 业务

对我来说幸运的是，我标记了 500,000 个名字，所以这成为一个监督问题。耶。

我的第一个模型运行是一个简单的朴素贝叶斯（多项式），下面是代码：

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df["CUST_NM_CLEAN"], 
                                                    df["LABEL"],test_size=0.20, 
                                                    random_state=1)

# Instantiate the CountVectorizer method
count_vector = CountVectorizer()

# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train)

# Transform testing data and return the matrix. 
testing_data = count_vector.transform(X_test)

#in this case we try multinomial, there are two other methods
from sklearn.naive_bayes import cNB
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data,y_train)
#MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

predictions = naive_bayes.predict(testing_data)


from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy score: {}'.format(accuracy_score(y_test, predictions)))
print('Precision score: {}'.format(precision_score(y_test, predictions, pos_label='Org')))
print('Recall score: {}'.format(recall_score(y_test, predictions, pos_label='Org')))
print('F1 score: {}'.format(f1_score(y_test, predictions, pos_label='Org')))

我得到的结果：

准确度分数：0.9524850665857665
精度分数：0.9828196680932295
召回分数：0.8890405236039549
F1 分数：0.9335809546092653

第一次去还不错。但是，当我将结果导出到文件并将预测与标签进行比较时，我得到的准确度非常低，大约在 60% 左右。这与 sklearn 输出的 95% 分数相去甚远...

有什么想法吗？

这是我输出文件的方式，可能是这种情况：

mnb_results = np.array(list(zip(df["CUST_NM_CLEAN"].values.tolist(),df["LABEL"],predictions)))
mnb_results = pd.DataFrame(mnb_results, columns=['name','predicted', 'label'])
mnb_results.to_csv('mnb_vectorized.csv', index = False)

P.s。我是新手，如果这里有明确的解决方案，请见谅。

Answer 1

需要注意的一件事是导出为 csv。如果您使用 csv 进行验证，那么我认为您将需要导出 x_test、y_test、预测。此外，cross-validation 也可以检查它是否按预期执行。

旧：

mnb_results = np.array(list(zip(df["CUST_NM_CLEAN"].values.tolist(),df["LABEL"],predictions)))

已更改：

mnb_results = np.array(list(zip(X_test, y_test, predictions)))

更多详情：

# Get the accuracy score using numpy, (Similarly others):
import numpy as np
true = np.asarray([1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0])
predictions = np.asarray([1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0])
print("Accuracy:{}".format(np.mean(true==predictions)))

Sklearn 准确度分数与朴素贝叶斯分类器的输出结果不匹配

Sklearn Accuracy Score does not match output results for Naive Bayes Classifer

python

scikit-learn

naivebayes