Weka 决策树节点数太高

Question

我正在尝试解释 Weka RandomTree 的字符串表示形式。训练集有 1000 条记录（实例）。查看字符串，叶子中的实例数似乎加起来为 1030。这怎么可能？我是否以某种方式误解了字符串？

请参阅下面的完整运行说明。

注意以下几点：

Total Number of Instances 1000

同时从叶子收集所有计数： (10/0),(1/0),(354/0),(18/1),(37/0),(11/0),(9/4),(5/0),(7/3),(5/0),(20/0),(1/0),(2/0),(168/0),(1/0),(145/0),(61/3),(3/1),(5/0),(44/13),(8/0),(10/2),(63/0),(8/3),(4/0)

共计1030条。

谢谢！

这是完整的运行描述：

=== Run information ===

Scheme:       weka.classifiers.trees.RandomTree -K 0 -M 1.0 -V 0.001 -S 1 -depth 5
Relation:     test-data
Instances:    1000
Attributes:   5
              feature1
              feature2
              feature3
              feature4
              class
Test mode:    evaluate on training data

=== Classifier model (full training set) ===


RandomTree
==========

feature2 < -0.27
|   feature2 < -0.61
|   |   feature3 < 1.09
|   |   |   feature2 < -2.41
|   |   |   |   feature2 < -2.45 : 0 (10/0)
|   |   |   |   feature2 >= -2.45 : 1 (1/0)
|   |   |   feature2 >= -2.41
|   |   |   |   feature2 < -0.7 : 0 (354/0)
|   |   |   |   feature2 >= -0.7 : 0 (18/1)
|   |   feature3 >= 1.09
|   |   |   feature2 < -0.94 : 0 (37/0)
|   |   |   feature2 >= -0.94
|   |   |   |   feature1 < -0.02 : 0 (11/0)
|   |   |   |   feature1 >= -0.02 : 0 (9/4)
|   feature2 >= -0.61
|   |   feature3 < -0.34
|   |   |   feature1 < 1.19 : 1 (5/0)
|   |   |   feature1 >= 1.19
|   |   |   |   feature2 < -0.39 : 0 (7/3)
|   |   |   |   feature2 >= -0.39 : 0 (5/0)
|   |   feature3 >= -0.34
|   |   |   feature2 < -0.32 : 0 (20/0)
|   |   |   feature2 >= -0.32
|   |   |   |   feature2 < -0.3 : 1 (1/0)
|   |   |   |   feature2 >= -0.3 : 0 (2/0)
feature2 >= -0.27
|   feature1 < 1.19
|   |   feature3 < -0.11 : 1 (168/0)
|   |   feature3 >= -0.11
|   |   |   feature3 < -0.1 : 0 (1/0)
|   |   |   feature3 >= -0.1
|   |   |   |   feature4 < 0.59 : 1 (145/0)
|   |   |   |   feature4 >= 0.59 : 1 (61/3)
|   feature1 >= 1.19
|   |   feature2 < 0.82
|   |   |   feature2 < -0.18
|   |   |   |   feature2 < -0.21 : 0 (3/1)
|   |   |   |   feature2 >= -0.21 : 0 (5/0)
|   |   |   feature2 >= -0.18
|   |   |   |   feature1 < 2.28 : 1 (44/13)
|   |   |   |   feature1 >= 2.28 : 0 (8/0)
|   |   feature2 >= 0.82
|   |   |   feature1 < 2.67
|   |   |   |   feature1 < 1.33 : 1 (10/2)
|   |   |   |   feature1 >= 1.33 : 1 (63/0)
|   |   |   feature1 >= 2.67
|   |   |   |   feature1 < 2.97 : 0 (8/3)
|   |   |   |   feature1 >= 2.97 : 1 (4/0)

Size of the tree : 49
Max depth of tree: 5

Time taken to build model: 0.05 seconds

=== Evaluation on training set ===

Time taken to test model on training data: 0.03 seconds

=== Summary ===

Correctly Classified Instances         970               97      %
Incorrectly Classified Instances        30                3      %
Kappa statistic                          0.94  
Mean absolute error                      0.0421
Root mean squared error                  0.145 
Relative absolute error                  8.4142 %
Root relative squared error             29.0073 %
Total Number of Instances             1000     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0.964    0.024    0.976      0.964    0.970      0.940    0.997     0.996     0
                 0.976    0.036    0.964      0.976    0.970      0.940    0.997     0.995     1
Weighted Avg.    0.970    0.030    0.970      0.970    0.970      0.940    0.997     0.996     

=== Confusion Matrix ===

   a   b   <-- classified as
 486  18 |   a = 0
  12 484 |   b = 1

Answer 1

您误解了括号中数字的含义。我认为您在节点处将其解释为 (Correct instances / Incorrect instances)，但实际上它意味着 (Total instances / Incorrect instances).

在每个叶节点，括号中都有一对数字，例如第七片叶子说：

feature1 >= -0.02 : 0 (9/4)

这意味着原始实例中有 9 个到达了这个叶子。 4 表示 在到达此叶 的 9 个实例中，有 4 个被错误分类。如果将括号中的所有第一个数字相加，它们的总和为 1000。第二个数字的总和为 30。这与稍后在输出中给出的错误数相匹配：

Correctly Classified Instances         970               97      %
Incorrectly Classified Instances        30                3      %

注意错误数只有在使用

时才会一致

=== Evaluation on training set ===

像你一样。 cross-validation.

下的数字会有所不同

Weka 决策树节点数太高

Weka decision tree node count too high

weka