如何在 CV-ing 数据集时实现基于比率的 SMOTE 过采样

Question

我正在处理关于二元分类问题的非常不平衡的数据集 (~5%)。我正在管道 SMOTE 和随机森林分类器，以使我的过采样发生在 GridSearch CV 循环内（如建议 here）。你可以在下面看到我的实现：

from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from imblearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold

sm = SMOTE()
rf = RandomForestClassifier()

pipeline = Pipeline([('sm', sm), ('rf', rf)])

kf = StratifiedKFold(n_splits = 5)

params = {'rf__max_depth' : list(range(2,5)),
    'rf__max_features' : ['auto','sqrt'],
    'rf__bootstrap' : [True, False]
}

grid = RandomizedSearchCV(pipeline, param_distributions = params, scoring = 'f1', cv = kf)

grid.fit(X, y)

但是，this paper (see Table 4 page 7) suggests testing different resampling ratios to figure out which one gives a better performance. Right now, with my sm = SMOTE() I'm generating a 50-50% dataset, but I would like to loop over a list of potential ratios (e.g. 5-95, 10-90, etc.). However, the ratio parameter in SMOTE 不接受所需的百分比，而是接受带有样本数量的特定整数，由于我的 kfold CV，我认为我无法做到这一点（每次折叠都可能样本量略有不同）。如何实施？

Answer 1

虽然文档中没有提到，但我认为你可以把float指定为ratio。但是你应该知道它已被弃用并将在未来的版本中被删除（因为我认为这只适用于二进制情况而不适用于 multiclass）。

params = {'sm__ratio' : [0.05, 0.10, 0.15],
          'rf__max_depth' : list(range(2,5)),
          'rf__max_features' : ['auto','sqrt'],
          'rf__bootstrap' : [True, False]
         }

grid = RandomizedSearchCV(pipeline, param_distributions = params, scoring = 'f1', cv = kf)

还要注意的是，你这里说的比例是class对少数class进行升采样后的比例。

假设您有如下原始 classes：

  1:  75
  0:  25

并且您将比率指定为 0.5。这里多数 class 不会被触及，但是会生成 12 个 class 0 的合成样本，所以最终的数字是：

  1:  75
  0:  37  (25 + 12)

最后的比例是 37 / 75 = 0.5（如你所说）。

如何在 CV-ing 数据集时实现基于比率的 SMOTE 过采样

How to implement ratio-based SMOTE oversampling while CV-ing dataset

python

scikit-learn

cross-validation