Imblearn Pipeline 导致指标不佳
Imblearn Pipeline resulting in poor metrics
我正在处理使用以下代码创建的不平衡数据集
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1,
weights=[0.99], flip_y=0, random_state=1)
我尝试使用 SMOTE 过采样消除不平衡,然后尝试拟合 ML 模型。这是使用普通方法然后通过创建管道完成的。
正常方法
from imblearn.over_sampling import SMOTE
oversampled_data = SMOTE(sampling_strategy=0.5)
X_over, y_over = oversampled_data.fit_resample(X, y)
logistic = LogisticRegression(solver='liblinear')
scoring = ['accuracy', 'precision', 'recall', 'f1']
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluating the model
scores = cross_validate(logistic, X_over, y_over, scoring=scoring, cv=cv, n_jobs=-1, return_train_score=True)
print('Accuracy: {:.2f}, Precison: {:.2f}, Recall: {:.2f} F1: {:.2f}'.format(np.mean(scores['test_accuracy']), np.mean(scores['test_precision']), np.mean(scores['test_recall']), np.mean(scores['test_f1'])))
输出 - 准确度:0.93,精度:0.92,召回率:0.86,F1:0.89
管道
from imblearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import cross_val_score
oversampled_data = SMOTE(sampling_strategy=0.5)
pipeline = Pipeline([('smote', oversampled_data), ('model', LogisticRegression())])
# pipeline = make_pipeline(oversampled_data, logistic)
scoring = ['accuracy', 'precision', 'recall', 'f1']
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluating the model
scores = cross_validate(pipeline, X, y, scoring=scoring, cv=cv, n_jobs=-1, return_train_score=True)
print('Accuracy: {:.2f}, Precison: {:.2f}, Recall: {:.2f}, F1: {:.2f}'.format(np.mean(scores['test_accuracy']), np.mean(scores['test_precision']), np.mean(scores['test_recall']), np.mean(scores['test_f1'])))
输出 - 准确度:0.96,精度:0.19,召回率:0.84,F1:0.31
我在使用 Pipeline 时做错了什么,为什么使用 Pipeline 时 Precision 和 F1 分数这么差?
在第一种方法中,您在 before 拆分训练集和测试集之前创建合成示例,而在第二种方法中,您在 after 中创建合成示例] 分裂.
前一种方法将合成数据点添加到测试集中,而后者则没有。此外,前一种方法会因数据泄漏而产生夸大的分数:它添加了(部分)基于训练数据集中某些数据点的合成测试样本。参见示例
我正在处理使用以下代码创建的不平衡数据集
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1,
weights=[0.99], flip_y=0, random_state=1)
我尝试使用 SMOTE 过采样消除不平衡,然后尝试拟合 ML 模型。这是使用普通方法然后通过创建管道完成的。
正常方法
from imblearn.over_sampling import SMOTE
oversampled_data = SMOTE(sampling_strategy=0.5)
X_over, y_over = oversampled_data.fit_resample(X, y)
logistic = LogisticRegression(solver='liblinear')
scoring = ['accuracy', 'precision', 'recall', 'f1']
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluating the model
scores = cross_validate(logistic, X_over, y_over, scoring=scoring, cv=cv, n_jobs=-1, return_train_score=True)
print('Accuracy: {:.2f}, Precison: {:.2f}, Recall: {:.2f} F1: {:.2f}'.format(np.mean(scores['test_accuracy']), np.mean(scores['test_precision']), np.mean(scores['test_recall']), np.mean(scores['test_f1'])))
输出 - 准确度:0.93,精度:0.92,召回率:0.86,F1:0.89
管道
from imblearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import cross_val_score
oversampled_data = SMOTE(sampling_strategy=0.5)
pipeline = Pipeline([('smote', oversampled_data), ('model', LogisticRegression())])
# pipeline = make_pipeline(oversampled_data, logistic)
scoring = ['accuracy', 'precision', 'recall', 'f1']
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluating the model
scores = cross_validate(pipeline, X, y, scoring=scoring, cv=cv, n_jobs=-1, return_train_score=True)
print('Accuracy: {:.2f}, Precison: {:.2f}, Recall: {:.2f}, F1: {:.2f}'.format(np.mean(scores['test_accuracy']), np.mean(scores['test_precision']), np.mean(scores['test_recall']), np.mean(scores['test_f1'])))
输出 - 准确度:0.96,精度:0.19,召回率:0.84,F1:0.31
我在使用 Pipeline 时做错了什么,为什么使用 Pipeline 时 Precision 和 F1 分数这么差?
在第一种方法中,您在 before 拆分训练集和测试集之前创建合成示例,而在第二种方法中,您在 after 中创建合成示例] 分裂.
前一种方法将合成数据点添加到测试集中,而后者则没有。此外,前一种方法会因数据泄漏而产生夸大的分数:它添加了(部分)基于训练数据集中某些数据点的合成测试样本。参见示例