为什么调用 fit 会重置 XGBClassifier 中的自定义 objective 函数？

Question

我已根据文档尝试设置 XGBoost sklearn API XGBClassifier 以使用自定义 objective 函数 (brier)：

    .. note::  Custom objective function

        A custom objective function can be provided for the ``objective``
        parameter. In this case, it should have the signature
        ``objective(y_true, y_pred) -> grad, hess``:

        y_true: array_like of shape [n_samples]
            The target values
        y_pred: array_like of shape [n_samples]
            The predicted values

        grad: array_like of shape [n_samples]
            The value of the gradient for each sample point.
        hess: array_like of shape [n_samples]
            The value of the second derivative for each sample point

这是我的尝试：

import numpy as np
from xgboost import XGBClassifier
from sklearn.datasets import load_svmlight_file
train_data = load_svmlight_file('~/agaricus.txt.train')
X = train_data[0].toarray()
y = train_data[1]

def brier(y_true, y_pred):
    y_pred = 1.0 / (1.0 + np.exp(-y_pred))
    grad = 2 * y_pred * (y_true - y_pred) * (y_pred - 1)
    hess = 2 * y_pred ** (1 - y_pred) * (2 * y_pred * (y_true + 1) - y_true - 3 * y_pred ** 2)
    return grad, hess

m = XGBClassifier(objective=brier, seed=42)

它似乎产生了正确的对象：

XGBClassifier(base_score=None, booster=None, colsample_bylevel=None,
              colsample_bynode=None, colsample_bytree=None, gamma=None,
              gpu_id=None, importance_type='gain', interaction_constraints=None,
              learning_rate=None, max_delta_step=None, max_depth=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=None, num_parallel_tree=None,
              objective=<function brier at 0x7fe7ac418290>, random_state=None,
              reg_alpha=None, reg_lambda=None, scale_pos_weight=None, seed=42,
              subsample=None, tree_method=None, validate_parameters=False,
              verbosity=None)

但是，调用 .fit 方法似乎将 m 对象重置为默认设置：

m.fit(X, y)
m
XGBClassifier(base_score=0.5, booster=None, colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints=None,
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=0, num_parallel_tree=1,
              objective='binary:logistic', random_state=42, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, seed=42, subsample=1,
              tree_method=None, validate_parameters=False, verbosity=None)

和 objective='binary:logistic'。我注意到，在调查为什么我直接针对 brier 进行优化时比使用默认 binary:logistic 时得分更差，如 here 所述。

那么，我如何正确设置 XGBClassifier 以将我的函数 brier 用作自定义 objective？

Answer 1

我认为您将 objective 误认为是 objective 函数（obj 作为参数），xgboost 文档有时非常混乱。

简而言之，对于你的问题，你只需要解决这个问题：

m = XGBClassifier(obj=brier, seed=42)

更深入一点，objective 是 xgboost 如何在给定 objective 函数的情况下进行优化。通常 xgboost 会根据 y 向量中类的数量进行优化。

我从 source code 中截取了一个片段，如您所见，当您只有两个类时，objective 设置为 binary:logistic:

class XGBClassifier(XGBModel, XGBClassifierBase):
    def __init__(self, objective="binary:logistic", **kwargs):
        super().__init__(objective=objective, **kwargs)

    def fit(self, X, y, sample_weight=None, base_margin=None,
            eval_set=None, eval_metric=None,
            early_stopping_rounds=None, verbose=True, xgb_model=None,
            sample_weight_eval_set=None, callbacks=None):

        evals_result = {}
        self.classes_ = np.unique(y)
        self.n_classes_ = len(self.classes_)

        xgb_options = self.get_xgb_params() # <-- obj function is set here

        if callable(self.objective):
            obj = _objective_decorator(self.objective) # <----- here is the mismatch of the names, if you pass objective as your brie func it will become "binary:logistic"
            xgb_options["objective"] = "binary:logistic"
        else:
            obj = None

        if self.n_classes_ > 2:
            xgb_options['objective'] = 'multi:softprob' # <----- objective is being set here if n_classes> 2
            xgb_options['num_class'] = self.n_classes_

+-- 35 lines: feval = eval_metric if callable(eval_metric) else None-----------------------------------------------------------------------------------------------------------------------------------------------------

        self._Booster = train(xgb_options, train_dmatrix, # <----- objective is being passed in xgb_options dictionary
                              self.get_num_boosting_rounds(),
                              evals=evals,
                              early_stopping_rounds=early_stopping_rounds,
                              evals_result=evals_result, obj=obj, feval=feval, # <----- obj function is being passed to lower level api here
                              verbose_eval=verbose, xgb_model=xgb_model,
                              callbacks=callbacks)

+-- 12 lines: self.objective = xgb_options["objective"]------------------------------------------------------------------------------------------------------------------------------------------------------------------

        return self

有一个固定的 list of objectives 列表，您可以设置 objective 个列表：

objective [默认=reg:squarederror]

reg:squarederror: regression with squared loss.

reg:squaredlogerror: regression with squared log loss 12[(+1)−(+1)]2. All input labels are required to be greater than -1. Also, see metric rmsle for possible issue with this objective.

reg:logistic: logistic regression

binary:logistic: logistic regression for binary classification, output probability

binary:logitraw: logistic regression for binary classification, output score before logistic transformation

binary:hinge: hinge loss for binary classification. This makes predictions of 0 or 1, rather than producing probabilities.

count:poisson –poisson regression for count data, output mean of poisson distribution

max_delta_step is set to 0.7 by default in poisson regression (used to safeguard optimization)

survival:cox: Cox regression for right censored survival time data (negative values are considered right censored). Note that predictions are returned on the hazard ratio scale (i.e., as HR = exp(marginal_prediction) in the proportional hazard function h(t) = h0(t) * HR).

multi:softmax: set XGBoost to do multiclass classification using the softmax objective, you also need to set num_class(number of classes)

multi:softprob: same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata * nclass matrix. The result contains predicted probability of each data point belonging to each class.

rank:pairwise: Use LambdaMART to perform pairwise ranking where the pairwise loss is minimized

rank:ndcg: Use LambdaMART to perform list-wise ranking where Normalized Discounted Cumulative Gain (NDCG) is maximized

rank:map: Use LambdaMART to perform list-wise ranking where Mean Average Precision (MAP) is maximized

reg:gamma: gamma regression with log-link. Output is a mean of gamma distribution. It might be useful, e.g., for modeling insurance claims severity, or for any outcome that might be gamma-distributed.

reg:tweedie: Tweedie regression with log-link. It might be useful, e.g., for modeling total loss in insurance, or for any outcome that might be Tweedie-distributed.

只是确认 objective 不能成为你的布里函数，在调用较低级别 api 之前在源代码中手动将 objective 设置为你的布里函数

class XGBClassifier(XGBModel, XGBClassifierBase):
    def __init__(self, objective="binary:logistic", **kwargs):
        super().__init__(objective=objective, **kwargs)

    def fit(self, X, y, sample_weight=None, base_margin=None,
            eval_set=None, eval_metric=None,
            early_stopping_rounds=None, verbose=True, xgb_model=None,
            sample_weight_eval_set=None, callbacks=None):

+-- 54 lines: evals_result = {}--------------------------------------------------------------------
        xgb_options["objective"] = xgb_options["obj"]
        self._Booster = train(xgb_options, train_dmatrix,
                              self.get_num_boosting_rounds(),
                              evals=evals,
                              early_stopping_rounds=early_stopping_rounds,
                              evals_result=evals_result, obj=obj, feval=feval,
                              verbose_eval=verbose, xgb_model=xgb_model,
                              callbacks=callbacks)

+-- 14 lines: self.objective = xgb_options["objective"]--------------------------------------------

引发此错误：

    raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [10:09:53] /private/var/folders/z5/mchb9bz51cx3h97nkw9v0wkr0000gn/T/pip-install-kh801rm0/xgboost/xgboost/src/objective/objective.cc:26: Unknown objective function: `<function brier at 0x10b630d08>`
Objective candidate: binary:hinge
Objective candidate: multi:softmax
Objective candidate: multi:softprob
Objective candidate: rank:pairwise
Objective candidate: rank:ndcg
Objective candidate: rank:map
Objective candidate: reg:squarederror
Objective candidate: reg:squaredlogerror
Objective candidate: reg:logistic
Objective candidate: binary:logistic
Objective candidate: binary:logitraw
Objective candidate: reg:linear
Objective candidate: count:poisson
Objective candidate: survival:cox
Objective candidate: reg:gamma
Objective candidate: reg:tweedie

为什么调用 fit 会重置 XGBClassifier 中的自定义 objective 函数？

Why calling fit resets custom objective function in XGBClassifier?

python

xgboost

xgbclassifier