袋外误差是如何准确计算出来的，它的含义是什么？

How is the out-of-bag error calculated, exactly, and what are its implications?

我找到了一些关于包外错误的解释，包括关于 Whosebug 的一个：What is out of bag error in random forests

但是我找不到任何公式来准确计算它。我们以 MATLAB 帮助文件为例： err = oobError(B) 计算错误class化概率[...]。 B 是用 class TreeBagger.

生成的树的模型

误class化的概率是多少？仅仅是袋外数据的准确性吗？

准确度 = (TP + FP) / (P+N)

所有真正 class 化的实例与集合中存在的所有实例的比率简单吗？

如果这是正确的，我一方面看到了计算它的好处，如果你有一些数据集要测试的话，这是非常简单的，因为袋外数据集是。

但另一方面，众所周知，当涉及到不平衡数据集时，准确度不是一个好的指标。所以我的第二个问题是：袋外错误能否应对不平衡的数据集，如果不能，在这种情况下指定它是否有效？

袋外误差只是对训练期间未见样本计算的误差。它在 bagging 方法中起着重要作用，因为由于训练集的引导（通过随机抽取和替换构建新集），您实际上得到了相当大一部分未使用的训练数据（限制在 30% 左右）。如果你有很多这样的模型（比如在随机森林中，你有很多树，每棵树都在自己的 boostrap 样本上训练），那么你可以对这些误差进行平均并得到泛化误差的估计。

What is the misclassification probability? Is it simply the accuracy of the out-of-bag data?

误分类概率为 1-Accuracy

If this is correct, I on the one hand see the benefit of calculating it, at is quite simple if you have some datasets to test on anyway, as the out-of-bag dataset are.

因为使用一个测试集只能近似当前模型的质量（无论它是什么），而袋外是一种对整体中单个元素的估计（随机森林中的树) 在训练集的所有可能选择上。这是不同的概率度量，例如参见 Tibshirani 的 统计学习要素 的第 7 章。此外，它的优势在于您不会浪费任何点数。保留一个单独的测试集需要大量的点，以便您可以对剩余数据进行合理的估计（模型）。袋外估算让您能够同时说明它的性能如何 - 使用所有可用数据。

But on the other hand, accuracy is known to be not a good metric when it comes to imbalanced datasets. So my second question then is: Can the out-of-bag error cope with imbalanced datasets, and if not, is it even a valid point to specify it in such cases?

袋外误差与准确性无关。它在 scikit-learn 中实现以准确工作，但它是在 any 损失函数（分类指标）上定义的。您可以使用 MCC、F1 或任何您想要的方式进行精确模拟。

袋外误差是如何准确计算出来的，它的含义是什么？

How is the out-of-bag error calculated, exactly, and what are its implications?

matlab

classification

machine-learning

random-forest