rpart R中的gini,信息增益和误差平方和之间的差异
Diffrents between gini, information gain and sum of square of errors in rpart R
我在 R 中编写了一个简短的代码来检查拆分标准的工作原理。得到了意想不到的结果,都选择了相同的值进行拆分。有人可以解释一下吗?这是代码:
set.seed(1)
y <- sample(c(1, 0), 10000, replace = T)
x <- seq(1, 10000)
data <- data.frame(x, y)
library(rpart)
rpart(y~x,data = data,parms=list(split="gini"),method = "class",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
rpart(y~x,data = data,parms=list(split="information"),method = "class",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
rpart(y~x,data = data,method = "anova",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
在我的例子中,只有最后一个 rpart
命令拆分了一些东西:
> set.seed(1)
> y <- sample(c(1, 0), 1000, replace = T)
> x <- seq(1, 1000)
> data <- data.frame(x, y)
> library(rpart)
未与 split="gini"
拆分:
> rpart(y~x,data = data,parms=list(split="gini"),method = "class",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
n= 1000
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 1000 480 1 (0.4800000 0.5200000) *
未与 split="information"
拆分:
> rpart(y~x,data = data,parms=list(split="information"),method = "class",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
n= 1000
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 1000 480 1 (0.4800000 0.5200000) *
有一个拆分 split="anova"
:
> rpart(y~x,data = data,method = "anova",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
n= 1000
node), split, n, deviance, yval
* denotes terminal node
1) root 1000 249.6000 0.5200000
2) x< 841.5 841 210.1831 0.5089180 *
3) x>=841.5 159 38.7673 0.5786164 *
关于为什么分割点可以在同一个位置,摘自rpart documentation的几点:
- 基尼系数与信息杂质(第 6 页):"For the two class problem the measures differ only slightly, and will nearly always choose the same split point."
- Gini 度量与 [ANalysis Of] 方差(第 41 页):“...对于两个 class 情况,Gini 拆分规则减少到 2p(1 − p),这是方差一个节点。
所以在两个class问题的情况下,不同的措施可能会产生相似的分裂点。
我在 R 中编写了一个简短的代码来检查拆分标准的工作原理。得到了意想不到的结果,都选择了相同的值进行拆分。有人可以解释一下吗?这是代码:
set.seed(1)
y <- sample(c(1, 0), 10000, replace = T)
x <- seq(1, 10000)
data <- data.frame(x, y)
library(rpart)
rpart(y~x,data = data,parms=list(split="gini"),method = "class",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
rpart(y~x,data = data,parms=list(split="information"),method = "class",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
rpart(y~x,data = data,method = "anova",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
在我的例子中,只有最后一个 rpart
命令拆分了一些东西:
> set.seed(1)
> y <- sample(c(1, 0), 1000, replace = T)
> x <- seq(1, 1000)
> data <- data.frame(x, y)
> library(rpart)
未与 split="gini"
拆分:
> rpart(y~x,data = data,parms=list(split="gini"),method = "class",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
n= 1000
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 1000 480 1 (0.4800000 0.5200000) *
未与 split="information"
拆分:
> rpart(y~x,data = data,parms=list(split="information"),method = "class",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
n= 1000
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 1000 480 1 (0.4800000 0.5200000) *
有一个拆分 split="anova"
:
> rpart(y~x,data = data,method = "anova",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
n= 1000
node), split, n, deviance, yval
* denotes terminal node
1) root 1000 249.6000 0.5200000
2) x< 841.5 841 210.1831 0.5089180 *
3) x>=841.5 159 38.7673 0.5786164 *
关于为什么分割点可以在同一个位置,摘自rpart documentation的几点:
- 基尼系数与信息杂质(第 6 页):"For the two class problem the measures differ only slightly, and will nearly always choose the same split point."
- Gini 度量与 [ANalysis Of] 方差(第 41 页):“...对于两个 class 情况,Gini 拆分规则减少到 2p(1 − p),这是方差一个节点。
所以在两个class问题的情况下,不同的措施可能会产生相似的分裂点。