AUC 高但数据不平衡的预测不佳

Question

我正在尝试在一个非常不平衡的数据集上使用 LightGBM 构建一个 classifier。不平衡的比例是 97:3，即：

Class

0    0.970691
1    0.029309

我使用的参数和训练代码如下图

lgb_params = {
        'boosting_type': 'gbdt',
        'objective': 'binary',
        'metric':'auc',
        'learning_rate': 0.1,
        'is_unbalance': 'true',  #because training data is unbalance (replaced with scale_pos_weight)
        'num_leaves': 31,  # we should let it be smaller than 2^(max_depth)
        'max_depth': 6, # -1 means no limit
        'subsample' : 0.78
    }

# Cross-validate
cv_results = lgb.cv(lgb_params, dtrain, num_boost_round=1500, nfold=10, 
                    verbose_eval=10, early_stopping_rounds=40)

nround = cv_results['auc-mean'].index(np.max(cv_results['auc-mean']))
print(nround)

model = lgb.train(lgb_params, dtrain, num_boost_round=nround)


preds = model.predict(test_feats)

preds = [1 if x >= 0.5 else 0 for x in preds]

I 运行 CV 以获得最佳模型和最佳回合。我在 CV 上获得了 0.994 AUC，在验证集中获得了相似的分数。

但是当我在测试集上进行预测时，我得到了非常糟糕的结果。我确信训练集的采样是完美的。

需要调整哪些参数？问题的原因是什么？我是否应该重新采样数据集以减少最高 class。？

Answer 1

问题是，尽管您的数据集中存在极端 class 不平衡，但在决定 [=18= 中的最终硬 class 化时，您仍在使用“默认”阈值 0.5 ]

preds = [1 if x >= 0.5 else 0 for x in preds]

这里应该不是。

这是一个相当大的话题，我强烈建议你自己研究（尝试谷歌搜索阈值或切断概率不平衡数据)，但这里有一些帮助您入门的提示...

来自 Cross Validated 的相关回答（强调已添加）：

Don't forget that you should be thresholding intelligently to make predictions. It is not always best to predict 1 when the model probability is greater 0.5. Another threshold may be better. To this end you should look into the Receiver Operating Characteristic (ROC) curves of your classifier, not just its predictive success with a default probability threshold.

来自相关学术论文，Finding the Best Classification Threshold in Imbalanced Classification：

2.2. How to set the classification threshold for the testing set

Prediction results are ultimately determined according to prediction probabilities. The threshold is typically set to 0.5. If the prediction probability exceeds 0.5, the sample is predicted to be positive; otherwise, negative. However, 0.5 is not ideal for some cases, particularly for imbalanced datasets.

来自（强烈推荐）应用预测建模博客的 post Optimizing Probability Thresholds for Class Imbalances 也很相关。

从以上所有内容中吸取教训：AUC 很少足够，但 ROC 曲线本身通常是您最好的朋友...

在更一般的层面上，关于阈值本身在 class 化过程中的作用（至少根据我的经验，许多从业者都错了），还要检查 Classification probability threshold交叉验证的线程（和提供的链接）；关键点：

the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.

AUC 高但数据不平衡的预测不佳

High AUC but bad predictions with imbalanced data

python

classification

machine-learning

auc

lightgbm