rpart R中的gini,信息增益和误差平方和之间的差异

Diffrents between gini, information gain and sum of square of errors in rpart R

我在 R 中编写了一个简短的代码来检查拆分标准的工作原理。得到了意想不到的结果,都选择了相同的值进行拆分。有人可以解释一下吗?这是代码:

set.seed(1)
y <- sample(c(1, 0), 10000, replace = T)
x <- seq(1, 10000)
data <- data.frame(x, y)

library(rpart)
rpart(y~x,data = data,parms=list(split="gini"),method = "class",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
rpart(y~x,data = data,parms=list(split="information"),method = "class",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
rpart(y~x,data = data,method = "anova",control = list(maxdepth = 1,cp=0.0001,minsplit=1))

在我的例子中,只有最后一个 rpart 命令拆分了一些东西:

> set.seed(1)
> y <- sample(c(1, 0), 1000, replace = T)
> x <- seq(1, 1000)
> data <- data.frame(x, y)
> library(rpart)

未与 split="gini" 拆分:

> rpart(y~x,data = data,parms=list(split="gini"),method = "class",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
n= 1000 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 1000 480 1 (0.4800000 0.5200000) *

未与 split="information" 拆分:

> rpart(y~x,data = data,parms=list(split="information"),method = "class",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
n= 1000 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 1000 480 1 (0.4800000 0.5200000) *

有一个拆分 split="anova":

> rpart(y~x,data = data,method = "anova",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
n= 1000 

node), split, n, deviance, yval
      * denotes terminal node

1) root 1000 249.6000 0.5200000  
  2) x< 841.5 841 210.1831 0.5089180 *
  3) x>=841.5 159  38.7673 0.5786164 *

关于为什么分割点可以在同一个位置,摘自rpart documentation的几点:

  • 基尼系数与信息杂质(第 6 页):"For the two class problem the measures differ only slightly, and will nearly always choose the same split point."
  • Gini 度量与 [ANalysis Of] 方差(第 41 页):“...对于两个 class 情况,Gini 拆分规则减少到 2p(1 − p),这是方差一个节点。

所以在两个class问题的情况下,不同的措施可能会产生相似的分裂点。