ID3 实施说明

Question

我正在尝试实现 ID3 算法，正在查看伪代码：

(Source)

我对它所说的一点感到困惑：

如果 examples_vi 为空，则在示例中的 TargegetAttribute 中创建标签 = 最常见值的叶节点。

除非我遗漏了什么，否则这不应该是最常见的 class 吗？

也就是说，如果我们不能根据属性值拆分数据，因为没有样本为特定属性取该值，那么我们会在所有样本中取最常见的 class 并使用它？

此外，这不就和随机选择一样好吗 class？

训练集没有告诉我们属性值和 class 标签之间的关系...

Answer 1

1) Unless I am missing out on something, shouldn't this be the most common class?

你说的对，文中也是这么说的。看上面的函数说明：

Target_Attribute is the attribute whose value is to be predicted by the tree

所以 Target_Attribute 的值是 class/label.

2) That is, if we cannot split the data on an attribute value because no sample takes that value for the particular attribute, then we take the most common class among all samples and use that?

是的，但不是在 whole 数据集中的 all 个样本中，而是在 all 个样本中=36=]。（ID3 函数是递归的，因此当前 Examples 实际上是调用者的 Examples_vi）

3) Also, isn't this just as good as picking a random class? The training set tells us nothing about the relation between the attribute value and the class labels...

不，随机选择 class（每个 class 的机会均等）是不一样的。因为通常输入确实有一个不平衡的 class 分布（这种分布在许多文本中通常被称为 先验分布 ），所以你可能有 99% 的例子是积极的，只有1% 负。因此，每当您真的没有任何信息来决定某些输入的结果时，预测最可能的 class 是有意义的，这样您就有最大的正确概率。仅在假设训练数据中的 class 分布与未见数据中的分布相同的情况下，这才能最大限度地提高 classifier 在未见数据上的准确性。

当 Attributes 为空时，此解释与基本情况的推理相同（请参阅伪代码文本中的第 4 行）；每当我们没有信息时，我们只报告手头数据中最常见的 class。

Answer 2

如果你没有实现代码(ID3)，但仍想了解更多处理细节，我建议你阅读这篇论文：
Building Decision Trees in Python 这是论文的源代码： decision tree source code This paper有一个例子或使用你书中的例子（用相同的格式替换"data"文件）。您可以在 eclipse 中调试它（使用一些断点）以检查算法期间的属性值运行。复习一下，你会更好地理解 ID3。

ID3 实施说明

ID3 Implementation Clarification

id3