随机森林回归中的样本大小

Question

如果理解正确，在计算随机森林估计量时通常会应用自举，这意味着树 (i) 仅使用来自样本 (i) 的数据构建，并通过替换选择。我想知道 sklearn RandomForestRegressor 使用的样本大小是多少。

我唯一看到的是接近：

bootstrap : boolean, optional (default=True)
    Whether bootstrap samples are used when building trees.

但是没有办法指定样本量的大小或比例，也没有告诉我默认样本量。

我觉得至少应该知道默认样本大小是多少，我错过了什么？

Answer 1

bootstrap 的样本量始终是样本数。

您没有遗漏任何内容，mailing list 上针对 RandomForestClassifier 提出了同样的问题：

The bootstrap sample size is always the same as the input sample size. If you feel up to it, a pull request updating the documentation would probably be quite welcome.

Answer 2

呃，我同意你的看法，很奇怪我们不能在 RandomForestRegressor 算法中指定 subsample/bootstrap 大小。也许一个潜在的解决方法是使用 BaggingRegressor 代替。 http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html#sklearn.ensemble.BaggingRegressor

RandomForestRegressor 只是 BaggingRegressor 的一个特例（使用引导程序来减少一组低偏差高方差估计量的方差）。在 RandomForestRegressor 中，基本估计量被强制为 DeceisionTree，而在 BaggingRegressor 中，您可以自由选择 base_estimator。更重要的是，您可以设置自定义的子样本大小，例如 max_samples=0.5 将抽取大小等于整个训练集一半的随机子样本。此外，您可以通过设置 max_features 和 bootstrap_features.

来仅选择一部分功能

Answer 3

在scikit-learn的0.22版本中，添加了max_samples选项，按照你的要求做：here class.

的文档

随机森林回归中的样本大小

Size of sample in Random Forest Regression

python

machine-learning

random-forest

scikit-learn