LightGBM 中的 predict_proba() 函数在内部是如何工作的?

How does the predict_proba() function in LightGBM work internally?

这是在内部理解如何使用 LightGBM 预测 class 的概率。

其他软件包,如 sklearn,为其 classifier 提供了详尽的细节。例如:

Probability estimates.

The returned estimates for all classes are ordered by the label of classes.

For a multi_class problem, if multi_class is set to be “multinomial” the softmax function is used to find the predicted probability of each class. Else use a one-vs-rest approach, i.e calculate the probability of each class assuming it to be positive using the logistic function. and normalize these values across all the classes.

Predict class probabilities for X.

The predicted class probabilities of an input sample are computed as the mean predicted class probabilities of the trees in the forest. The class probability of a single tree is the fraction of samples of the same class in a leaf.

还有其他 Stack Overflow 问题提供了更多详细信息,例如:

我正试图揭示 LightGBM 的 predict_proba 函数的那些相同细节。 The documentation 没有列出如何计算概率的详细信息。

文档简单地指出:

Return the predicted probability for each class for each sample.

源代码如下:

def predict_proba(self, X, raw_score=False, start_iteration=0, num_iteration=None,
                  pred_leaf=False, pred_contrib=False, **kwargs):
    """Return the predicted probability for each class for each sample.

    Parameters
    ----------
    X : array-like or sparse matrix of shape = [n_samples, n_features]
        Input features matrix.
    raw_score : bool, optional (default=False)
        Whether to predict raw scores.
    start_iteration : int, optional (default=0)
        Start index of the iteration to predict.
        If <= 0, starts from the first iteration.
    num_iteration : int or None, optional (default=None)
        Total number of iterations used in the prediction.
        If None, if the best iteration exists and start_iteration <= 0, the best iteration is used;
        otherwise, all iterations from ``start_iteration`` are used (no limits).
        If <= 0, all iterations from ``start_iteration`` are used (no limits).
    pred_leaf : bool, optional (default=False)
        Whether to predict leaf index.
    pred_contrib : bool, optional (default=False)
        Whether to predict feature contributions.

        .. note::

            If you want to get more explanations for your model's predictions using SHAP values,
            like SHAP interaction values,
            you can install the shap package (https://github.com/slundberg/shap).
            Note that unlike the shap package, with ``pred_contrib`` we return a matrix with an extra
            column, where the last column is the expected value.

    **kwargs
        Other parameters for the prediction.

    Returns
    -------
    predicted_probability : array-like of shape = [n_samples, n_classes]
        The predicted probability for each class for each sample.
    X_leaves : array-like of shape = [n_samples, n_trees * n_classes]
        If ``pred_leaf=True``, the predicted leaf of every tree for each sample.
    X_SHAP_values : array-like of shape = [n_samples, (n_features + 1) * n_classes] or list with n_classes length of such objects
        If ``pred_contrib=True``, the feature contributions for each sample.
    """
    result = super(LGBMClassifier, self).predict(X, raw_score, start_iteration, num_iteration,
                                                 pred_leaf, pred_contrib, **kwargs)
    if callable(self._objective) and not (raw_score or pred_leaf or pred_contrib):
        warnings.warn("Cannot compute class probabilities or labels "
                      "due to the usage of customized objective function.\n"
                      "Returning raw scores instead.")
        return result
    elif self._n_classes > 2 or raw_score or pred_leaf or pred_contrib:
        return result
    else:
        return np.vstack((1. - result, result)).transpose()

我如何理解 LightGBMpredict_proba 函数在内部是如何工作的?

简短说明

下面我们可以看到每个方法在后台调用的内容的图示。首先,class LGBMClassifierpredict_proba() 方法是从 LGBMModel 调用 predict() 方法(它继承自它)。

LGBMClassifier.predict_proba() (inherits from LGBMModel)
  |---->LGBMModel().predict() (calls LightGBM Booster)
          |---->Booster.predict()

然后,它从 LightGBM Booster (Booster class) 调用 predict() 方法。为了调用此方法,应先训练 Booster。

基本上,Booster 是通过调用其 predict() 方法为每个样本生成预测值的方法。请参阅下文,详细了解此助推器的工作原理。

详细说明或 LightGBM Booster 是如何工作的?

我们试图回答 LightGBM booster 如何工作的问题?。通过 Python 代码,我们可以大致了解它是如何训练和更新的。但是,还有一些对 LightGBM 的 C++ 库的进一步引用,我无法解释。但是,对 LightGBM 的 Booster 工作流程进行了概述。

一个。初始化和训练 Booster

LGBMModel_Booster是通过调用train()函数初始化的,在sklearn.py的595行我们看到如下代码

self._Booster = train(params, train_set,
                      self.n_estimators, valid_sets=valid_sets, valid_names=eval_names,
                      early_stopping_rounds=early_stopping_rounds,
                      evals_result=evals_result, fobj=self._fobj, feval=feval,
                      verbose_eval=verbose, feature_name=feature_name,
                      callbacks=callbacks, init_model=init_model)

Note. train() comes from engine.py.

train() 中,我们看到 Booster 已初始化(第 231 行)

# construct booster
try:
    booster = Booster(params=params, train_set=train_set)
...

并在每次训练迭代时更新(第 242 行)。

for i in range_(init_iteration, init_iteration + num_boost_round):
     ...
     ... 
     booster.update(fobj=fobj)
     ...

乙。 booster.update() 是如何工作的?

要了解 update() 方法的工作原理,我们应该转到 basic.py 的第 2315 行。在这里,我们看到这个函数 更新 Booster 一次迭代 .

两种更新助推器的选择,具体取决于您是否提供objective功能。

  • Objective函数为None

在第 2367 行,我们得到以下代码

if fobj is None:
    ...
    ...
    _safe_call(_LIB.LGBM_BoosterUpdateOneIter(
               self.handle,
               ctypes.byref(is_finished)))
    self.__is_predicted_cur_iter = [False for _ in range_(self.__num_dataset)]
    return is_finished.value == 1

请注意,由于未提供 objective 函数 (fobj),它会通过从 _LIB 调用 LGBM_BoosterUpdateOneIter 来更新助推器。简而言之,_LIB 是加载的 C++ LightGBM 库。

What is _LIB?

_LIB is a variable that stores the loaded LightGBM library by calling _load_lib() (line 29 of basic.py).

Then _load_lib() loads the LightGBM library by finding on your system the path to lib_lightgbm.dll(Windows) or lib_lightgbm.so (Linux).

  • Objective 提供的功能

当遇到自定义对象函数时,我们得到以下情况

else:
    ...
    ...
    grad, hess = fobj(self.__inner_predict(0), self.train_set)

其中 __inner_predict() 是来自 LightGBM 的 Booster 的方法(参见 basic.py for more details of the Booster class), which predicts for training and validation data. Inside __inner_predict() (line 3142 of basic.py 的第 1930 行)我们看到它从 _LIB 调用 LGBM_BoosterGetPredict 来获得预测,即是,

_safe_call(_LIB.LGBM_BoosterGetPredict(
                self.handle,
                ctypes.c_int(data_idx),
                ctypes.byref(tmp_out_len),
                data_ptr))

最后,在更新 range_(init_iteration, init_iteration + num_boost_round) 次助推器后,它将被训练。因此,Booster.predict()可以被LightGBMClassifier.predict_proba()调用。

Note. The booster is trained as part of the model fitting step, especifically by LGBMModel.fit(), see line 595 of sklearn.py for code details.

LightGBM和所有用于分类的梯度提升方法一样,本质上结合了决策树和逻辑回归。我们从表示概率 (a.k.a.softmax) 的相同逻辑函数开始:

P(y = 1 | X) = 1/(1 + exp(Xw))

有趣的是,特征矩阵 X 由决策树集成的终端节点组成。然后这些都由 w 加权,这是一个必须学习的参数。用于学习权重的机制取决于所使用的精确学习算法。同样,X的构造也取决于算法。例如,LightGBM 引入了两个新特性,使它们获得了优于 XGBoost 的性能改进:"Gradient-based One-Side Sampling" and "Exclusive Feature Bundling"。一般来说,每一行收集每个样本的终端叶子,列代表终端叶子。

所以这是文档可以说的...

Probability estimates.

The predicted class probabilities of an input sample are computed as the softmax of the weighted terminal leaves from the decision tree ensemble corresponding to the provided sample.

有关更多详细信息,您必须深入研究增强、XGBoost 和最后的 LightGBM 论文的细节,但考虑到您提供的其他文档示例,这似乎有点笨拙。