Python 中不平衡 Class 的欠采样

Undersampling for Imbalanced Class in Python

我目前有超过 800,000 个数据点的不平衡数据集。不平衡很严重，因为两个 classes 之一只有 3719 个数据点。在 Python 中使用 NearMiss 算法对数据进行欠采样并应用随机森林 classifier，我能够获得以下结果：

准确率：81.4%
准确率：82.6%
召回率：79.4%
特异性：83.4%

然而，当再次在完整数据集上重新测试同一模型时，混淆矩阵结果由于某种原因显示出对少数 class 的较大偏差，显示出大量误报。这是欠采样后测试模型的正确方法吗？

首先从 800k 条记录欠采样到 4k 可能是您领域知识的一大损失。大多数时候，您首先进行过采样，然后进行欠采样。有专门的软件包：imblearn. As for validation: you don't want to score resampled records, as it'll mess things up. Look closer into scoring params in sklearn, namely: micro, macro, weighted. Docs are here。这还有一些特定的指标。在这里查看：

Python 中不平衡 Class 的欠采样

Undersampling for Imbalanced Class in Python

python

machine-learning

downsampling