SciKit-Learn 随机森林子样本大小如何等于原始训练数据大小？

Question

在 SciKit-Learn 随机森林分类器的文档中，指出

The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).

我不明白的是，如果 样本大小始终与输入样本大小相同 那么我们如何谈论随机选择。这里没有选择，因为我们在每次训练中使用所有（自然是相同的）样本。

我是不是遗漏了什么？

Answer 1

我相信 this part 的文档可以回答您的问题

In random forests (see RandomForestClassifier and RandomForestRegressor classes), each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. In addition, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features. As a result of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but, due to averaging, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.

理解的关键在"sample drawn with replacement"。这意味着每个实例可以绘制不止一次。这反过来意味着，火车集中的某些实例出现了多次，而有些则根本不存在（包外）。这些对于不同的树是不同的

Answer 2

当然不是每棵树都选择了所有样本。默认情况下，每个样本都有 1-((N-1)/N)^N~0.63 的机会被一棵特定的树采样，0.63^2 被采样两次，0.63^3 被采样 3 次......其中 N 是训练集的样本大小。

每个 bootstrap 样本选择在平均上与其他 bootstraps 足够不同，因此决策树足够不同，因此树的平均预测对每棵树的方差是稳健的模型。如果样本大小可以增加到训练集大小的 5 倍，则每个观察值可能会在每棵树中出现 3-7 次，整体预测性能将受到影响。

Answer 3

@communitywiki 的回答遗漏了以下问题：“我不明白的是，如果样本大小始终与输入样本大小相同，那么我们如何谈论随机选择”：它必须做具有引导本身的性质。 Bootstrapping 包括重复相同的值不同的时间，但仍然具有与原始数据相同的样本大小：示例（由 Bootstrapping/Approach 的维基 page 提供）：

原始样本：[1,2,3,4,5]
助推器 1：[1,2,4,4,1]
Bootstrap 2: [1,1,3,3,5]

等等。

这就是随机选择发生的方式，样本量仍然可以保持不变。

Answer 4

虽然我是 python 的新手，但我遇到了类似的问题。

我试图让 RandomForestClassifier 适合我的数据。我将数据分成训练和测试：

train_x, test_x, train_y, test_y = train_test_split(X, Y, test_size=0.2, random_state=0)

DF 的长度相同，但在我预测模型之后：

rfc_pred = rf_mod.predict(test_x)

结果长度不同。

为了解决这个问题，我将 bootstrap 选项设置为 false：

param_grid = {
    'bootstrap': [False],
    'max_depth': [110, 150, 200],
    'max_features': [3, 5],
    'min_samples_leaf': [1, 3],
    'min_samples_split': [6, 8],
    'n_estimators': [100, 200]
}

然后运行整个过程重新开始。它工作正常，我可以计算我的混淆矩阵。但我想了解如何使用 bootstrap 并生成具有相同长度的预测数据。

SciKit-Learn 随机森林子样本大小如何等于原始训练数据大小？

How can SciKit-Learn Random Forest sub sample size may be equal to original training data size?

python

subsampling

random-forest

scikit-learn