如何在分类模型中正确使用 Smote

Question

我正在使用 smote 来平衡模型训练的输出 (y)，但我想用原始数据测试模型，因为它使我们如何使用 smote 创建的输出测试模型成为逻辑。如果我没有很好地解释，请要求澄清。这是我在 Stack overflow 上的开始。

from imblearn.over_sampling import SMOTE
oversample = SMOTE()
X_sm, y_sm = oversample.fit_resample(X, y)

# Splitting Dataset into Train and Test (Smote)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm,test_size=0.2,random_state=42)

这里我对我的数据应用了随机森林分类器

import math
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sn
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

# RF = RandomForestClassifier(n_estimators=100)
# RF.fit(X_train, y_train.values.ravel())
# y_pred = RF.predict(X)
# print(metrics.classification_report(y,y_pred))

RF = RandomForestClassifier(n_estimators=10)
RF.fit(X_train, y_train.values.ravel())

如果我应用了这个，但 X 也包含我们用于训练的数据。我们如何删除已经用于训练数据的数据。

y_pred = RF.predict(X)
print(metrics.classification_report(y,y_pred))

Answer 1

我过去使用过 SMOTE，它不是最理想的。最近，研究人员证明了合成少数过采样技术 (SMOTE) 生成的分布存在一些缺陷。我知道有时我们没有关于不平衡的 classes 的选择，但你可以使用 sklearn.ensemble.RandomForestClassifier，你可以在其中定义适当的 class_weight 来处理不平衡的 class问题。

检查 scikit-learn 文档：

Scikit-documentation

Answer 2

我同意 razimbres 关于使用 class_weight 的看法。您的另一个选择是首先将数据集拆分为训练和测试。然后，将测试放在一边。从这里开始只使用训练集：

X_sm, y_sm = oversample.fit_resample(X_train, y_train)
.
.
.

如何在分类模型中正确使用 Smote

How to properly use Smote in Classification models

python

machine-learning

scikit-learn

data-science

jupyter-notebook