rpart R中的gini，信息增益和误差平方和之间的差异

Question

我在 R 中编写了一个简短的代码来检查拆分标准的工作原理。得到了意想不到的结果，都选择了相同的值进行拆分。有人可以解释一下吗？这是代码：

set.seed(1)
y <- sample(c(1, 0), 10000, replace = T)
x <- seq(1, 10000)
data <- data.frame(x, y)

library(rpart)
rpart(y~x,data = data,parms=list(split="gini"),method = "class",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
rpart(y~x,data = data,parms=list(split="information"),method = "class",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
rpart(y~x,data = data,method = "anova",control = list(maxdepth = 1,cp=0.0001,minsplit=1))

Answer 1

在我的例子中，只有最后一个 rpart 命令拆分了一些东西：

> set.seed(1)
> y <- sample(c(1, 0), 1000, replace = T)
> x <- seq(1, 1000)
> data <- data.frame(x, y)
> library(rpart)

未与 split="gini" 拆分：

> rpart(y~x,data = data,parms=list(split="gini"),method = "class",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
n= 1000 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 1000 480 1 (0.4800000 0.5200000) *

未与 split="information" 拆分：

> rpart(y~x,data = data,parms=list(split="information"),method = "class",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
n= 1000 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 1000 480 1 (0.4800000 0.5200000) *

有一个拆分 split="anova":

> rpart(y~x,data = data,method = "anova",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
n= 1000 

node), split, n, deviance, yval
      * denotes terminal node

1) root 1000 249.6000 0.5200000  
  2) x< 841.5 841 210.1831 0.5089180 *
  3) x>=841.5 159  38.7673 0.5786164 *

关于为什么分割点可以在同一个位置，摘自rpart documentation的几点：

基尼系数与信息杂质（第 6 页）："For the two class problem the measures differ only slightly, and will nearly always choose the same split point."
Gini 度量与 [ANalysis Of] 方差（第 41 页）：“...对于两个 class 情况，Gini 拆分规则减少到 2p(1 − p)，这是方差一个节点。

所以在两个class问题的情况下，不同的措施可能会产生相似的分裂点。

rpart R中的gini，信息增益和误差平方和之间的差异

Diffrents between gini, information gain and sum of square of errors in rpart R

split

r

rpart