为什么 ctree 在这种情况下只返回一个终端节点？

Question

简介

我正在学习 AI 的基础知识。我创建了一个包含随机数据的 .csv 文件来测试 Decision Trees. I'm currently using R in Jupyther Notebook.

问题

温度、湿度和风是决定您是否被允许飞行的变量。

当我执行 ctree(vuelo~., data=vuelo.csv) 输出时，它只是一个节点，而 我期待一个完整的变量树（Temperatura、Humdedad、Viento），因为我纸上谈兵。

Snippet of the result

使用的数据是下一个table:

   temperatura humedad viento vuelo
1          Hot    High   Weak    No
2          Hot    High Strong    No
3          Hot    High   Weak   Yes
4         Mild    High   Weak   Yes
5         Cool  Normal   Weak   Yes
6         Cool  Normal Strong    No
7         Cool  Normal Strong   Yes
8         Mild    High   Weak    No
9         Cool  Normal   Weak   Yes
10        Mild  Normal   Weak   Yes
11        Mild  Normal Strong   Yes
12        Mild    High Strong   Yes
13         Hot  Normal   Weak   Yes
14        Mild    High Strong    No

我不确定在导入数据时是否遗漏了什么，但我所做的是：

test <- read.csv("vuelo.csv")

备注

我正在使用 R 中的“聚会”库（其中包含我从中获得一些想法的示例）

编辑：

这里是 dput() 请求的结果

structure(list(temperatura = structure(c(2L, 2L, 2L, 3L, 1L, 
1L, 1L, 3L, 1L, 3L, 3L, 3L, 2L, 3L), .Label = c("Cool", "Hot", 
"Mild"), class = "factor"), humedad = structure(c(1L, 1L, 1L, 
1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L), .Label = c("High", 
"Normal"), class = "factor"), viento = structure(c(2L, 1L, 2L, 
2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L), .Label = c("Strong", 
"Weak"), class = "factor"), vuelo = structure(c(1L, 1L, 2L, 2L, 
2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 1L), .Label = c("No", "Yes"
), class = "factor")), class = "data.frame", row.names = c(NA, 
-14L))

Answer 1

回答

ctree 仅在达到统计显着性时才创建拆分（有关基础测试，请参阅 ?ctree）。在您的情况下，none 的拆分是这样做的，因此没有提供拆分。

在你的情况下，你可以强制一个完整的树通过弄乱控件（参见 ?ctree 和 ?ctree_control），例如像这样：

ctree(vuelo~., data = vuelo.csv, 
      control = ctree_control(minbucket = 0, 
                               minsplit = 0,
                               testtype = "Teststatistic",
                               mincriterion = 0))

但是，从统计的角度来看这没有意义，我强烈建议不要这样做。

更合适的解决方案是将更多观察结果包含到您的数据集中。假设温度、湿度和风与是否允许飞行存在潜在关联，您会通过更多观察发现它。

为了完整起见，如果我们在输出上使用 plot，那么我们会得到包含所有（不具有统计意义的）分支的树：

为什么 ctree 在这种情况下只返回一个终端节点？

Why ctree is only returning a single terminal node in this case?

r

decision-tree