Class 不平衡结合重采样和特定算法

Question

我正在处理 multi-label 文本 class 化问题（目标标签总数为 90）。数据分布有一条长尾巴和大约 1900k 条记录。目前，我正在处理大约 10 万条具有相似目标分布的记录的小样本。

一些算法提供了处理 class 不平衡的功能，例如 PAC、LinearSVC。目前，我还在做 SMOTE 来为除多数人之外的所有人生成样本，并使用 RandomUnderSampler 来抑制多数人 class 的不平衡。

同时使用算法参数和 imblearn 管道来处理 class 不平衡是否正确？

feat_pipeline = FeatureUnion([('text', text_pipeline)])

estimators_list = [
                   ('PAC',PassiveAggressiveClassifier(max_iter=5000,random_state=0,class_weight='balanced')),
                   ('linearSVC', LinearSVC(class_weight='balanced'))
                  ]
estimators_ensemble = StackingClassifier(estimators=estimators_list, 
                                         final_estimator=LogisticRegression(solver='lbfgs',max_iter=5000))
ovr_ensemble = OneVsRestClassifier(estimators_ensemble)

classifier_pipeline = imblearnPipeline([
        ('features', feat_pipeline),
        ('over_sampling', SMOTE(sampling_strategy='auto')), # resample all classes but the majority class;
        ('under_sampling',RandomUnderSampler(sampling_strategy='auto')), # resample all classes but the minority class;
        ('ovr_ensemble', ovr_ensemble)
    ])

Answer 1

Is it right to use both the algorithm parameter & imblearn pipelines at the same time to handle class imbalance?

让我们花点时间思考一下这可能意味着什么，以及它是否真的有意义。

用于处理 class 不平衡的特定算法（或算法设置）自然会预期一些实际数据不平衡。

现在，如果您已经人为地平衡了您的数据（使用 SMOTE、多数 class 欠采样等），您的算法最终将面临的是平衡数据集，而不是不平衡的数据集。不用说，这些算法无法 "knowing" 他们看到的最终数据中的这种平衡是人为的；所以，从他们的角度来看，没有不平衡——因此不需要任何特殊的配方来启动。

所以，这样做并不是错误，但在这种情况下，这些特定的 algorithms/settings 实际上没有用，因为关于 class 失衡的处理，他们不会提供任何额外的服务。

引自（完全不同的问题，但总体思路是横向的）：

The field of deep neural nets is still (very) young, and it is true that it has yet to establish its "best practice" guidelines; add the fact that, thanks to an amazing community, there are all sort of tools available in open source implementations, and you can easily find yourself into the (admittedly tempting) position of mixing things up just because they happen to be available. I am not necessarily saying that this is what you are attempting to do here - I am just urging for more caution when combining ideas that may have not been designed to work along together...

Class 不平衡结合重采样和特定算法

Combine resampling and specific algorithms for Class Imbalance

machine-learning

scikit-learn

multilabel-classification

imblearn

imbalanced-data