在朴素贝叶斯中应用交叉验证

Applying Cross validation in Naive bayes

我的数据集是垃圾邮件和非垃圾邮件菲律宾邮件

我将我的数据集分为 60% 训练、20% 测试和 20% 验证

将数据拆分为测试、训练和验证

from sklearn.model_selection import train_test_split


data['label'] = (data['label'].replace({'ham'  : 0,
                                         'spam' : 1}))
X_train, X_test, y_train, y_test = train_test_split(data['message'], 
                                                        data['label'], test_size=0.2, random_state=1)
    
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1) # 0.25 x 0.8 = 0.2 
print('Total: {} rows'.format(data.shape[0]))
print('Train: {} rows'.format(X_train.shape[0]))
print(' Test: {} rows'.format(X_test.shape[0]))
print(' Validation: {} rows'.format(X_val.shape[0]))

从 sklearn 训练一个 MultinomialNB

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
import numpy as np
naive_bayes = MultinomialNB().fit(train_data,
                                  y_train)
predictions = naive_bayes.predict(test_data)

评估模型

from sklearn.metrics import (accuracy_score, 
                             precision_score,
                             recall_score, 
                             f1_score)
accuracy_score = accuracy_score(y_test,
                                predictions)
precision_score = precision_score(y_test,
                                  predictions)
recall_score = recall_score(y_test,
                            predictions)
f1_score = f1_score(y_test,
                    predictions)

我的问题出在验证中。错误说

warnings.warn("Estimator fit failed. The score on this train-test"

这就是我编写验证代码的方式,不知道我做的是否正确

 from sklearn.model_selection import cross_val_score
    
    mnb = MultinomialNB()
    scores = cross_val_score(mnb,X_val,y_val, cv = 10, scoring='accuracy')
    
    print('Cross-validation scores:{}'.format(scores))

首先,值得注意的是,因为它称为交叉验证,并不意味着您必须像您在代码中所做的那样使用 验证集 来执行交叉验证.执行交叉验证的原因有很多,其中包括:

  • 确保您的所有数据集都用于训练以及评估模型的性能
  • 执行超参数调整。

因此,您的案例倾向于第一个用例。因此,您不需要先执行 train, val, and test 的拆分。相反,您可以对整个数据集执行 10 折交叉验证。

如果你正在进行超参数化,那么你可以有一个 hold-out 的集合,比如 30%,并将剩余的 70% 用于交叉验证。确定最佳参数后,您可以使用 hold-out 集对具有最佳参数的模型进行评估。

一些参考资料:

https://towardsdatascience.com/5-reasons-why-you-should-use-cross-validation-in-your-data-science-project-8163311a1e79

https://www.analyticsvidhya.com/blog/2021/11/top-7-cross-validation-techniques-with-python-code/

https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6

我没有收到任何错误或警告。或许可以。

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
import numpy as np
from sklearn.metrics import (accuracy_score, 
                             precision_score,
                             recall_score, 
                             f1_score)
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction.text import CountVectorizer

df = pd.read_csv("https://raw.githubusercontent.com/jeffprosise/Machine-Learning/master/Data/ham-spam.csv")

vectorizer = CountVectorizer(ngram_range=(1, 2), stop_words='english')
x = vectorizer.fit_transform(df['Text'])
y = df['IsSpam']

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)    
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1) # 0.25 x 0.8 = 0.2 

print('Total: {} rows'.format(data.shape[0]))
print('Train: {} rows'.format(X_train.shape[0]))
print(' Test: {} rows'.format(X_test.shape[0]))
print(' Validation: {} rows'.format(X_val.shape[0]))

naive_bayes = MultinomialNB().fit(X_train, y_train)
predictions = naive_bayes.predict(X_test)

accuracy_score = accuracy_score(y_test,predictions)
precision_score = precision_score(y_test, predictions)
recall_score = recall_score(y_test, predictions)
f1_score = f1_score(y_test, predictions)

mnb = MultinomialNB()
scores = cross_val_score(mnb,X_val,y_val, cv = 10, scoring='accuracy')
print('Cross-validation scores:{}'.format(scores))

结果:

Total: 1000 rows
Train: 600 rows
 Test: 200 rows
 Validation: 200 rows
Cross-validation scores:[1.   0.95 0.85 1.   1.   0.9  0.9  0.8  0.9  0.9 ]