用于垃圾邮件分类的重采样数据集

Question

我对以下数据集有 class 不平衡问题：

Text                             is_it_capital?     is_it_upper?      contains_num?   Label
an example of text                      0                  0               0            0
ANOTHER example of text                 1                  1               0            1
What's happening?Let's talk at 5        1                  0               1            1

和类似的。我有 5000 rows/texts（4500 class 0 和 500 class 1）。

我需要对我的 classes 重新采样，但我不知道在我的模型中的什么地方包括这个步骤，所以如果你能看一下并告诉我我是不是，我将不胜感激缺少某些步骤，或者您发现方法中存在任何不一致之处。

对于火车，测试我正在使用以下内容：

X_train, X_test, y_train, y_test  = train_test_split(X, y, test_size=0.25, random_state=40)

其中X是

X=df[['Text','is_it_capital?', 'is_it_upper?', 'contains_num?']]
y=df['Label']

df_train= pd.concat([X_train, y_train], axis=1)
df_test = pd.concat([X_test, y_test], axis=1)


# Separating classes

spam = df_train[df_train.Label == 1]
not_spam = df_train[df_train.Label == 0]

# Oversampling  

oversampl = resample(spam,replace=True,n_samples=len(not_spam), random_state=42)

oversampled = pd.concat([not_spam, oversampl])
df_train = oversampled.copy()

输出（错误？）：

              precision    recall  f1-score   support

         0.0       0.94      0.98      0.96      3600
         1.0       0.76      0.52      0.62       400

    accuracy                           0.93      4000
   macro avg       0.86      0.77      0.80      4000

weighted avg       0.92      0.93      0.93      4000

你认为我对数据集进行过采样的步骤有问题吗，因为混淆矩阵给我的支持度是 400 而不是更高？

抱歉这么长 post，但我认为报告所有步骤是值得的，以便更好地理解我所采用的方法。

Answer 1

你的方法没有问题，评估报告显示数据不平衡是正常的。这是因为：

重采样（正确地）仅在训练集上进行，以强制模型更加重视少数class。
评估是（正确地）在遵循原始不平衡分布的测试集上进行的。对测试集重新采样也是错误的，因为评估必须完成 on the true distribution of the data.

用于垃圾邮件分类的重采样数据集

Resampling dataset for spam classification

python

classification

resampling

scikit-learn

text-classification