如何使用过采样和欠采样的组合？学习不平衡

Question

我想对一些大数据进行重采样（class 大小：8mio vs 2700）我想通过过采样 class 2 和欠采样 class 1 获得 50.000 个样本。 imblearn 似乎提供了过采样和欠采样的组合，但我不明白它是如何工作的。

from collections import Counter
from imblearn.over_sampling import SMOTENC
from imblearn.under_sampling import TomekLinks
from imblearn.combine import SMOTETomek

smt = SMOTETomek(random_state=1)
X_resamp, y_resamp = smt.fit_resample(data_all[29000:30000], labels_all[29000:30000])

之前的数据看起来像

>>Counter(labels_all[29000:30000])
>>Counter({0: 968, 9: 32})

之后

>>Counter(y_resamp)
>>Counter({0: 968, 9: 968})

正如我所期望或希望的那样

>>Counter(y_resamp)
>>Counter({0: 100, 9: 100})

Answer 1

您似乎只有 32 条 class 9 的记录，因此它对 class 进行了采样并将其数据记录与 class [=11] 的数据记录对齐=] 因此 9: 968

你说的是将数据集减少到 100 条记录，你可以从 X 和 Y 中为每个 class 随机抽取 100 条记录（相同的 100 条记录） ) 或者像 y_resamp[:100]

这样取前 100 个

如何使用过采样和欠采样的组合？学习不平衡

How to use combination of over- and undersampling? with imbalanced learn

python

machine-learning

oversampling

imblearn

imbalanced-data