R:rpart 树使用两个解释变量增长,但在删除不太重要的变量后不会增长
R: rpart tree grows using two explanatory variables, but not after removing less important variable
数据: 我正在使用 rsample 包中的 "attrition" dataset。
问题: 使用损耗数据集和 rpart 库,我可以使用公式 "Attrition ~ OverTime + JobRole" 来种植一棵树,其中 OverTime 被选为第一次拆分。但是当我尝试在没有 JobRole 变量(即 "Attrition ~ OverTime")的情况下生长树时,树不会分裂并且 returns 只有根节点。这发生在使用 rpart 函数以及 caret 的 train 函数和 method = "rpart" 的情况下。
我对此感到困惑,因为我认为在 rpart 中实现的 CART 算法选择了最好的变量以迭代贪婪的方式进行拆分,而没有 "look ahead" 查看其他变量的存在情况影响其最佳拆分的 selection。如果算法 select 在具有两个解释变量的情况下将 OverTime 作为一个有价值的第一次拆分,为什么在删除 JobRole 变量后 select OverTime 不作为一个有价值的第一次拆分?
我正在使用 R 版本 3.4.2 和 RStudio 版本 1.1.442 Windows 7.
研究: 我发现了类似的 Stack Overflow 问题 here and here,但都没有完整的答案。
据我所知,rpart docs 似乎在第 5 页上说 rpart 算法不使用 "look ahead" 规则:
One way around both of these problems is to use look-ahead rules; but these are computationally
very expensive. Instead rpart uses one of several measures of impurity, or
diversity, of a node.
CODE: 这是一个代表。任何见解都会很棒 - 谢谢!
suppressPackageStartupMessages(library(rsample))
#> Warning: package 'rsample' was built under R version 3.4.4
suppressPackageStartupMessages(library(rpart))
suppressPackageStartupMessages(library(caret))
suppressPackageStartupMessages(library(dplyr))
#> Warning: package 'dplyr' was built under R version 3.4.3
suppressPackageStartupMessages(library(purrr))
#################################################
# look at data
data(attrition)
attrition_subset <- attrition %>% select(Attrition, OverTime, JobRole)
attrition_subset %>% glimpse()
#> Observations: 1,470
#> Variables: 3
#> $ Attrition <fctr> Yes, No, Yes, No, No, No, No, No, No, No, No, No, N...
#> $ OverTime <fctr> Yes, No, Yes, Yes, No, No, Yes, No, No, No, No, Yes...
#> $ JobRole <fctr> Sales_Executive, Research_Scientist, Laboratory_Tec...
map_dfr(.x = attrition_subset, .f = ~ sum(is.na(.x)))
#> # A tibble: 1 x 3
#> Attrition OverTime JobRole
#> <int> <int> <int>
#> 1 0 0 0
#################################################
# with rpart
attrition_rpart_w_JobRole <- rpart(Attrition ~ OverTime + JobRole, data = attrition_subset, method = "class", cp = .01)
attrition_rpart_w_JobRole
#> n= 1470
#>
#> node), split, n, loss, yval, (yprob)
#> * denotes terminal node
#>
#> 1) root 1470 237 No (0.83877551 0.16122449)
#> 2) OverTime=No 1054 110 No (0.89563567 0.10436433) *
#> 3) OverTime=Yes 416 127 No (0.69471154 0.30528846)
#> 6) JobRole=Healthcare_Representative,Manager,Manufacturing_Director,Research_Director 126 11 No (0.91269841 0.08730159) *
#> 7) JobRole=Human_Resources,Laboratory_Technician,Research_Scientist,Sales_Executive,Sales_Representative 290 116 No (0.60000000 0.40000000)
#> 14) JobRole=Human_Resources,Research_Scientist,Sales_Executive 204 69 No (0.66176471 0.33823529) *
#> 15) JobRole=Laboratory_Technician,Sales_Representative 86 39 Yes (0.45348837 0.54651163) *
attrition_rpart_wo_JobRole <- rpart(Attrition ~ OverTime, data = attrition_subset, method = "class", cp = .01)
attrition_rpart_wo_JobRole
#> n= 1470
#>
#> node), split, n, loss, yval, (yprob)
#> * denotes terminal node
#>
#> 1) root 1470 237 No (0.8387755 0.1612245) *
#################################################
# with caret
attrition_caret_w_JobRole_non_dummies <- train(x = attrition_subset[ , -1], y = attrition_subset[ , 1], method = "rpart", tuneGrid = expand.grid(cp = .01))
attrition_caret_w_JobRole_non_dummies$finalModel
#> n= 1470
#>
#> node), split, n, loss, yval, (yprob)
#> * denotes terminal node
#>
#> 1) root 1470 237 No (0.83877551 0.16122449)
#> 2) OverTime=No 1054 110 No (0.89563567 0.10436433) *
#> 3) OverTime=Yes 416 127 No (0.69471154 0.30528846)
#> 6) JobRole=Healthcare_Representative,Manager,Manufacturing_Director,Research_Director 126 11 No (0.91269841 0.08730159) *
#> 7) JobRole=Human_Resources,Laboratory_Technician,Research_Scientist,Sales_Executive,Sales_Representative 290 116 No (0.60000000 0.40000000)
#> 14) JobRole=Human_Resources,Research_Scientist,Sales_Executive 204 69 No (0.66176471 0.33823529) *
#> 15) JobRole=Laboratory_Technician,Sales_Representative 86 39 Yes (0.45348837 0.54651163) *
attrition_caret_w_JobRole <- train(Attrition ~ OverTime + JobRole, data = attrition_subset, method = "rpart", tuneGrid = expand.grid(cp = .01))
attrition_caret_w_JobRole$finalModel
#> n= 1470
#>
#> node), split, n, loss, yval, (yprob)
#> * denotes terminal node
#>
#> 1) root 1470 237 No (0.8387755 0.1612245)
#> 2) OverTimeYes< 0.5 1054 110 No (0.8956357 0.1043643) *
#> 3) OverTimeYes>=0.5 416 127 No (0.6947115 0.3052885)
#> 6) JobRoleSales_Representative< 0.5 392 111 No (0.7168367 0.2831633) *
#> 7) JobRoleSales_Representative>=0.5 24 8 Yes (0.3333333 0.6666667) *
attrition_caret_wo_JobRole <- train(Attrition ~ OverTime, data = attrition_subset, method = "rpart", tuneGrid = expand.grid(cp = .01))
attrition_caret_wo_JobRole$finalModel
#> n= 1470
#>
#> node), split, n, loss, yval, (yprob)
#> * denotes terminal node
#>
#> 1) root 1470 237 No (0.8387755 0.1612245) *
这很有道理。上面的代码有点多,所以我会重复重要的部分。
library(rsample)
library(rpart)
data(attrition)
rpart(Attrition ~ OverTime + JobRole, data=attrition)
n= 1470
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 1470 237 No (0.83877551 0.16122449)
2) OverTime=No 1054 110 No (0.89563567 0.10436433) *
3) OverTime=Yes 416 127 No (0.69471154 0.30528846)
6) JobRole=Healthcare_Representative,Manager,Manufacturing_Director,Research_Director 126 11 No (0.91269841 0.08730159) *
7) JobRole=Human_Resources,Laboratory_Technician,Research_Scientist,Sales_Executive,Sales_Representative 290 116 No (0.60000000 0.40000000)
14) JobRole=Human_Resources,Research_Scientist,Sales_Executive 204 69 No (0.66176471 0.33823529) *
15) JobRole=Laboratory_Technician,Sales_Representative 86 39 Yes (0.45348837 0.54651163) *
rpart(Attrition ~ OverTime, data=attrition)
n= 1470
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 1470 237 No (0.8387755 0.1612245) *
看看第一个模型(有两个变量)。在根下面我们有:
1) root 1470 237 No (0.83877551 0.16122449)
2) OverTime=No 1054 110 No (0.89563567 0.10436433) *
3) OverTime=Yes 416 127 No (0.69471154 0.30528846)
模型继续拆分节点 3(OverTime=Yes),但仅 使用 JobRole。由于我们在第二个模型中没有 JobRole,因此 rpart 无法进行其他拆分。但请注意,在节点 2 和节点 3 中,Attrition=No 是大多数 class。在节点 3,69.5% 的实例为否,30.5% 为是。因此,对于节点 2 和 3,我们都将预测为否。由于分裂两侧的预测相同,因此分裂是不必要的并被修剪掉。只需要根节点就可以预测所有的实例都是No.
数据: 我正在使用 rsample 包中的 "attrition" dataset。
问题: 使用损耗数据集和 rpart 库,我可以使用公式 "Attrition ~ OverTime + JobRole" 来种植一棵树,其中 OverTime 被选为第一次拆分。但是当我尝试在没有 JobRole 变量(即 "Attrition ~ OverTime")的情况下生长树时,树不会分裂并且 returns 只有根节点。这发生在使用 rpart 函数以及 caret 的 train 函数和 method = "rpart" 的情况下。
我对此感到困惑,因为我认为在 rpart 中实现的 CART 算法选择了最好的变量以迭代贪婪的方式进行拆分,而没有 "look ahead" 查看其他变量的存在情况影响其最佳拆分的 selection。如果算法 select 在具有两个解释变量的情况下将 OverTime 作为一个有价值的第一次拆分,为什么在删除 JobRole 变量后 select OverTime 不作为一个有价值的第一次拆分?
我正在使用 R 版本 3.4.2 和 RStudio 版本 1.1.442 Windows 7.
研究: 我发现了类似的 Stack Overflow 问题 here and here,但都没有完整的答案。
据我所知,rpart docs 似乎在第 5 页上说 rpart 算法不使用 "look ahead" 规则:
One way around both of these problems is to use look-ahead rules; but these are computationally very expensive. Instead rpart uses one of several measures of impurity, or diversity, of a node.
CODE: 这是一个代表。任何见解都会很棒 - 谢谢!
suppressPackageStartupMessages(library(rsample))
#> Warning: package 'rsample' was built under R version 3.4.4
suppressPackageStartupMessages(library(rpart))
suppressPackageStartupMessages(library(caret))
suppressPackageStartupMessages(library(dplyr))
#> Warning: package 'dplyr' was built under R version 3.4.3
suppressPackageStartupMessages(library(purrr))
#################################################
# look at data
data(attrition)
attrition_subset <- attrition %>% select(Attrition, OverTime, JobRole)
attrition_subset %>% glimpse()
#> Observations: 1,470
#> Variables: 3
#> $ Attrition <fctr> Yes, No, Yes, No, No, No, No, No, No, No, No, No, N...
#> $ OverTime <fctr> Yes, No, Yes, Yes, No, No, Yes, No, No, No, No, Yes...
#> $ JobRole <fctr> Sales_Executive, Research_Scientist, Laboratory_Tec...
map_dfr(.x = attrition_subset, .f = ~ sum(is.na(.x)))
#> # A tibble: 1 x 3
#> Attrition OverTime JobRole
#> <int> <int> <int>
#> 1 0 0 0
#################################################
# with rpart
attrition_rpart_w_JobRole <- rpart(Attrition ~ OverTime + JobRole, data = attrition_subset, method = "class", cp = .01)
attrition_rpart_w_JobRole
#> n= 1470
#>
#> node), split, n, loss, yval, (yprob)
#> * denotes terminal node
#>
#> 1) root 1470 237 No (0.83877551 0.16122449)
#> 2) OverTime=No 1054 110 No (0.89563567 0.10436433) *
#> 3) OverTime=Yes 416 127 No (0.69471154 0.30528846)
#> 6) JobRole=Healthcare_Representative,Manager,Manufacturing_Director,Research_Director 126 11 No (0.91269841 0.08730159) *
#> 7) JobRole=Human_Resources,Laboratory_Technician,Research_Scientist,Sales_Executive,Sales_Representative 290 116 No (0.60000000 0.40000000)
#> 14) JobRole=Human_Resources,Research_Scientist,Sales_Executive 204 69 No (0.66176471 0.33823529) *
#> 15) JobRole=Laboratory_Technician,Sales_Representative 86 39 Yes (0.45348837 0.54651163) *
attrition_rpart_wo_JobRole <- rpart(Attrition ~ OverTime, data = attrition_subset, method = "class", cp = .01)
attrition_rpart_wo_JobRole
#> n= 1470
#>
#> node), split, n, loss, yval, (yprob)
#> * denotes terminal node
#>
#> 1) root 1470 237 No (0.8387755 0.1612245) *
#################################################
# with caret
attrition_caret_w_JobRole_non_dummies <- train(x = attrition_subset[ , -1], y = attrition_subset[ , 1], method = "rpart", tuneGrid = expand.grid(cp = .01))
attrition_caret_w_JobRole_non_dummies$finalModel
#> n= 1470
#>
#> node), split, n, loss, yval, (yprob)
#> * denotes terminal node
#>
#> 1) root 1470 237 No (0.83877551 0.16122449)
#> 2) OverTime=No 1054 110 No (0.89563567 0.10436433) *
#> 3) OverTime=Yes 416 127 No (0.69471154 0.30528846)
#> 6) JobRole=Healthcare_Representative,Manager,Manufacturing_Director,Research_Director 126 11 No (0.91269841 0.08730159) *
#> 7) JobRole=Human_Resources,Laboratory_Technician,Research_Scientist,Sales_Executive,Sales_Representative 290 116 No (0.60000000 0.40000000)
#> 14) JobRole=Human_Resources,Research_Scientist,Sales_Executive 204 69 No (0.66176471 0.33823529) *
#> 15) JobRole=Laboratory_Technician,Sales_Representative 86 39 Yes (0.45348837 0.54651163) *
attrition_caret_w_JobRole <- train(Attrition ~ OverTime + JobRole, data = attrition_subset, method = "rpart", tuneGrid = expand.grid(cp = .01))
attrition_caret_w_JobRole$finalModel
#> n= 1470
#>
#> node), split, n, loss, yval, (yprob)
#> * denotes terminal node
#>
#> 1) root 1470 237 No (0.8387755 0.1612245)
#> 2) OverTimeYes< 0.5 1054 110 No (0.8956357 0.1043643) *
#> 3) OverTimeYes>=0.5 416 127 No (0.6947115 0.3052885)
#> 6) JobRoleSales_Representative< 0.5 392 111 No (0.7168367 0.2831633) *
#> 7) JobRoleSales_Representative>=0.5 24 8 Yes (0.3333333 0.6666667) *
attrition_caret_wo_JobRole <- train(Attrition ~ OverTime, data = attrition_subset, method = "rpart", tuneGrid = expand.grid(cp = .01))
attrition_caret_wo_JobRole$finalModel
#> n= 1470
#>
#> node), split, n, loss, yval, (yprob)
#> * denotes terminal node
#>
#> 1) root 1470 237 No (0.8387755 0.1612245) *
这很有道理。上面的代码有点多,所以我会重复重要的部分。
library(rsample)
library(rpart)
data(attrition)
rpart(Attrition ~ OverTime + JobRole, data=attrition)
n= 1470
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 1470 237 No (0.83877551 0.16122449)
2) OverTime=No 1054 110 No (0.89563567 0.10436433) *
3) OverTime=Yes 416 127 No (0.69471154 0.30528846)
6) JobRole=Healthcare_Representative,Manager,Manufacturing_Director,Research_Director 126 11 No (0.91269841 0.08730159) *
7) JobRole=Human_Resources,Laboratory_Technician,Research_Scientist,Sales_Executive,Sales_Representative 290 116 No (0.60000000 0.40000000)
14) JobRole=Human_Resources,Research_Scientist,Sales_Executive 204 69 No (0.66176471 0.33823529) *
15) JobRole=Laboratory_Technician,Sales_Representative 86 39 Yes (0.45348837 0.54651163) *
rpart(Attrition ~ OverTime, data=attrition)
n= 1470
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 1470 237 No (0.8387755 0.1612245) *
看看第一个模型(有两个变量)。在根下面我们有:
1) root 1470 237 No (0.83877551 0.16122449)
2) OverTime=No 1054 110 No (0.89563567 0.10436433) *
3) OverTime=Yes 416 127 No (0.69471154 0.30528846)
模型继续拆分节点 3(OverTime=Yes),但仅 使用 JobRole。由于我们在第二个模型中没有 JobRole,因此 rpart 无法进行其他拆分。但请注意,在节点 2 和节点 3 中,Attrition=No 是大多数 class。在节点 3,69.5% 的实例为否,30.5% 为是。因此,对于节点 2 和 3,我们都将预测为否。由于分裂两侧的预测相同,因此分裂是不必要的并被修剪掉。只需要根节点就可以预测所有的实例都是No.