如何使用整个训练示例来估计 sklearn RandomForest 中的 class 概率

How to use whole training example to estimate class probabilities in sklearn RandomForest

我想使用 scikit-learn RandomForestClassifier 来估计给定示例属于一组 classes 的概率，当然是在事先训练之后。

我知道我可以使用 predict_proba 方法获得 class 概率，该方法将它们计算为

[...] the mean predicted class probabilities of the trees in the forest.

在this question中提到：

The probabilities returned by a single tree are the normalized class histograms of the leaf a sample lands in.

现在，我一直在阅读一些关于概率估计的论文，并意识到没有简单的解决方案。根据Estimating Class Probabilities in Random Forests (Böstrom):

using the same examples to both grow the trees and estimate the probabilities, [...] by necessity will lead to pure (and therefore small) estimation sets

这很糟糕。解决方案似乎是使用训练集中的所有示例，而不是仅使用 bootstrap 示例中用于生成树的示例。

Scikit-learn 确实只使用每棵树的 bootstrap 个样本来计算每个 class 的概率估计，对吗？ 有人对如何继续使 class 概率来自 RandomForest 的整个训练集有任何指示吗？

我假设这需要一些特殊的 Tree subclassing，它不会将 class 概率分配给树的叶子，然后一些程序从使用整个训练集的 RandomForest classifier。

Scikit-learn does use only the bootstrap sample for each tree to calculate the probability estimate of each class, right?

不，它只使用样本内部分，因此不会给出非常校准的概率输出（我猜这是论文所建议的）。

您可以使用样本外估计获得更好的概率估计，甚至可以使用当前代码库轻松完成。也许使用校准方法作为 post-处理（使用袋外样品）会更好。

总之，你要实现的就是默认的。

如何使用整个训练示例来估计 sklearn RandomForest 中的 class 概率

How to use whole training example to estimate class probabilities in sklearn RandomForest

machine-learning

random-forest

scikit-learn