如何减少 xgboost 中的误报？

Question

我的数据集在 0 和 1 个分类器之间平均分配。总共 100,000 个数据点，其中 50,000 个被归类为 0，另外 50,000 个被归类为 1。我对 train/test 数据进行了 80/20 拆分，并返回了 98% 的准确率分数。但是，在查看混淆矩阵时，我发现了很多误报。一般来说，我是 xgboost 和决策树的新手。我可以在 XGBClassifier 中更改哪些设置以减少误报的数量，或者甚至可能吗？谢谢。

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=0, stratify=y) # 80% training and 20% test

model = xgb.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.1, max_delta_step=0, max_depth=9,
              min_child_weight=1, missing=None, monotone_constraints='()',
              n_estimators=180, n_jobs=4, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', use_label_encoder=False,
              validate_parameters=1, verbosity=None)

model.fit(X_train,
           y_train,
           verbose = True, 
           early_stopping_rounds=10,
           eval_metric = "aucpr",
           eval_set = [(X_test, y_test)])

plot_confusion_matrix(model,
                      X_test,
                      y_test,
                      values_format='d',
                      display_labels=['Old Forests', 'Not Old Forests'])

Answer 1

是的如果您正在寻找一个简单的修复，您可以降低 scale_pos_weight 的值。即使您的数据集是平衡的，这也会降低误报率。

要获得更强大的修复，您将需要运行超参数调整搜索。特别是你应该尝试不同的值：scale_pos_weight、alpha、lambda、gamma 和 min_child_weight。因为它们对模型的保守程度影响最大。

如何减少 xgboost 中的误报？

How to reduce false positives in xgboost?

python-3.x

scikit-learn

xgboost