为什么总和 "value" 不等于 scikit-learn RandomForestClassifier 中 "samples" 的数量？

Question

我通过 RandomForestClassifier 构建了一个随机森林并绘制了决策树。参数“值”（红色箭头所指）是什么意思？为什么 [] 中两个数字的总和不等于“样本”的数量？我看到一些其他的例子， [] 中两个数字的总和等于“样本”的数量。为什么在我的情况下没有？

df = pd.read_csv("Dataset.csv")
df.drop(['Flow ID', 'Inbound'], axis=1, inplace=True)
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df.dropna(inplace = True)
df.Label[df.Label == 'BENIGN'] = 0
df.Label[df.Label == 'DrDoS_LDAP'] = 1
Y = df["Label"].values
Y = Y.astype('int')
X = df.drop(labels = ["Label"], axis=1)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.5)
model = RandomForestClassifier(n_estimators = 20)
model.fit(X_train, Y_train)
Accuracy = model.score(X_test, Y_test)
        
for i in range(len(model.estimators_)):
    fig = plt.figure(figsize=(15,15))
    tree.plot_tree(model.estimators_[i], feature_names = df.columns, class_names = ['Benign', 'DDoS'])
    plt.savefig('.\TheForest\T'+str(i))

Answer 1

不错的收获。

虽然没有记录，但这是由于 bootstrap 采样 默认情况下发生在随机森林模型中（有关更多信息，请参阅我在中的回答关于 RF 算法的详细信息及其与仅仅是“一堆”决策树的区别）。

让我们看一个使用 iris 数据的例子：

from sklearn.datasets import load_iris
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier

iris = load_iris()

rf = RandomForestClassifier(max_depth = 3)
rf.fit(iris.data, iris.target)

tree.plot_tree(rf.estimators_[0]) # take the first tree

此处的结果与您报告的结果相似：对于除右下角以外的所有其他节点，sum(value) 不等于 samples，因为 [=24= 应该是这种情况].

细心的观察者会注意到其他一些看起来很奇怪的东西：而 iris 数据集有 150 个样本：

print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class

树的基节点应该包括所有的，第一个节点的samples只有89。

这是为什么，这里到底发生了什么？看看，让我们拟合第二个 RF 模型，这次 没有 bootstrap 采样 （即有 bootstrap=False）：

rf2 = RandomForestClassifier(max_depth = 3, bootstrap=False) # no bootstrap sampling
rf2.fit(iris.data, iris.target)

tree.plot_tree(rf2.estimators_[0]) # take again the first tree

好吧，现在我们已经禁用了 bootstrap 采样，一切看起来都“不错”：每个节点中 value 的总和等于 samples，并且基节点确实包含整个数据集（150 个样本）。

因此，您描述的行为似乎确实是由于 bootstrap 抽样，在创建样本时替换（即以 duplicate samples for each individual decision tree of the ensemble），这些重复样本没有反映在树节点的 sample 值中，树节点显示 unique样本；然而，它是反映在节点 value.

这种情况与 RF 回归模型以及 Bagging 分类器的情况完全相似 - 分别参见：

sklearn RandomForestRegressor discrepancy in the displayed tree values

为什么总和 "value" 不等于 scikit-learn RandomForestClassifier 中 "samples" 的数量？

Why the sum "value" isn't equal to the number of "samples" in scikit-learn RandomForestClassifier?

python

machine-learning

decision-tree

random-forest

scikit-learn