了解 RandomForestClassifier 中的 max_features 参数

Question

我正在分析 RandomForestClasifier 需要一些帮助。

max_features 参数给出了随机森林中分裂的最大特征数，通常定义为 sqrt(n_features)。如果 m 是 n 的 sqrt，则 DT 形成的组合数为 nCm。如果 nCm 小于 n_estimators（随机森林中决策树的数量）怎么办？

示例： 对于 n = 7，max_features 为 3，因此 nCm 为 35，这意味着决策树具有 35 个独特的特征组合。现在n_estimators = 100，剩下的65棵树会不会有重复组合的特征？如果是这样，树木是否会相关，从而在答案中引入偏差？

Answer 1

max_features 参数设置每次拆分时使用的最大特征数。因此，如果有 p 个节点，.
max_samples 强制对来自 X 的数据点进行采样。默认情况下，它的采样大小与 X 的大小相同。

来自文档：

max_samples int or float, default=None

If bootstrap is True, the number of samples to draw from X to train each base estimator.

If None (default), then draw X.shape[0] samples.

因此，可以形成的树的唯一组合是 p! * nCm * (n+n-1)! / (n!(n-1)!)

对于您的示例，让我们假设每棵树中有 10 个节点，X 中有 10 个样本。

10! * 7C3 * (19!/ 10! * 9!)
= 11732745024000.0

因此，合理大小的数据集不会有任何偏差。

了解 RandomForestClassifier 中的 max_features 参数

Understanding max_features parameter in RandomForestClassifier

decision-tree

random-forest

scikit-learn

ensembles