如何处理 Scikit.learn 管道中不平衡的 xgboost 多类分类？

Question

我正在使用 XGBClassifier 为不平衡的多类目标建模。我有几个问题：

First I would like to now where should I use the parameter weight on the instantion of the classifier or on the fit step of the pipeline?

Second question is how I calculate a weights. I assume that the sum of the array should be 1.

Third: Is there any order of the weight array that maps the diferent label classes?

提前谢谢大家

Answer 1

第一个问题：

where should I use the parameter weight

在XGBClassifier.fit()

中使用sample_weight

xgb_clf = xgb.XGBClassifier()
xgb_clf.fit(X, y, sample_weight=sample_weight)

使用pipeline时：

pipe = Pipeline([
    ('my_xgb_clf', xgb.XGBClassifier()),
])
pipe.fit(X, y, my_xgb_clf__sample_weight=sample_weight)

顺便说一句，sklearn 中的某些 API 不支持 sample_weight kwarg，例如 learning_curve.

所以我只是这样做：

import functools
xgb_clf.fit = functools.partial(xgb_clf.fit, sample_weight=sample_weight)

注意：您需要在网格搜索后再次修补 fit()，因为 GridSearchCV.best_estimator_ 将不是原始估计器。

第二个问题：

how I calculate a weights. I assume that the sum of the array should be 1.

from sklearn.utils import compute_sample_weight
sample_weight = compute_sample_weight('balanced', y_train)

这模拟了 sklearn 中的 class_weight='balanced'。

注：

数组的总和不是1。你可以将它归一化，但我认为得分结果会有所不同。
这不等于 class_weight='balanced_subsample' 我找不到模拟这个的方法。

第三题：

Is there any order...

抱歉，我不明白你的意思...

也许您想要 xgb_clf.classes_ 中的订单？您可以在调用 xgb_clf.fit 后访问它。或者只使用 np.unique(y_train).

如何处理 Scikit.learn 管道中不平衡的 xgboost 多类分类？

How to deal with unbalanced xgboost multiclass classification within Scikit.learn pipeline?

python

scikit-learn

xgboost