不平衡的多类分类数据集:欠采样还是过采样?

Imbalanced multiclass classification dataset: undersample or oversample?

数据集有大约 15 万条记录,带有四个标签:['A'、'B'、'C'、'D'],分布如下:
答:60000
B: 50000
C: 36000
D: 4000

我注意到使用包 classification 报告来获取精度、召回率和 f1-score,f1-score 导致了 UndefinedMetricWarning 因为 class D 没有被预测到记录数少。

我知道我需要执行 oversample/undersample 来修复不平衡的数据。

问题:修复不平衡数据但从每个 class 中随机抽取 4000 条记录以使其平衡是个好主意吗?

我想你想从你的 class D 中进行过采样。该技术称为合成少数过采样技术或 SMOTE。

One way to solve this problem is to oversample the examples in the minority class. This can be achieved by simply duplicating examples from the minority class in the training dataset prior to fitting a model. This can balance the class distribution but does not provide any additional information to the model.

An improvement on duplicating examples from the minority class is to synthesize new examples from the minority class. This is a type of data augmentation for tabular data and can be very effective.

来源:https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/