为什么 rpart 比 R 中的 Caret rpart 更准确
Why is rpart more accurate than Caret rpart in R
这个 post 提到由于自举和交叉验证,Caret rpart 比 rpart 更准确:
虽然当我比较这两种方法时,我得到的 Caret rpart 的准确度是 0.4879,rpart 的准确度是 0.7347(我已经复制了下面的代码)。
此外,与 rpart
相比,Caret rpart 的分类树只有几个节点(拆分)
有谁明白这些区别吗?
谢谢!
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Loading libraries and the data
This is an R Markdown document. First we load the libraries and the data and split the trainingdata into a training and a testset.
```{r section1, echo=TRUE}
# load libraries
library(knitr)
library(caret)
suppressMessages(library(rattle))
library(rpart.plot)
# set the URL for the download
wwwTrain <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
wwwTest <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
# download the datasets
training <- read.csv(url(wwwTrain))
testing <- read.csv(url(wwwTest))
# create a partition with the training dataset
inTrain <- createDataPartition(training$classe, p=0.05, list=FALSE)
TrainSet <- training[inTrain, ]
TestSet <- training[-inTrain, ]
dim(TrainSet)
# set seed for reproducibility
set.seed(12345)
```
## Cleaning the data
```{r section2, echo=TRUE}
# remove variables with Nearly Zero Variance
NZV <- nearZeroVar(TrainSet)
TrainSet <- TrainSet[, -NZV]
TestSet <- TestSet[, -NZV]
dim(TrainSet)
dim(TestSet)
# remove variables that are mostly NA
AllNA <- sapply(TrainSet, function(x) mean(is.na(x))) > 0.95
TrainSet <- TrainSet[, AllNA==FALSE]
TestSet <- TestSet[, AllNA==FALSE]
dim(TrainSet)
dim(TestSet)
# remove identification only variables (columns 1 to 5)
TrainSet <- TrainSet[, -(1:5)]
TestSet <- TestSet[, -(1:5)]
dim(TrainSet)
```
## Prediction modelling
First we build a classification model using Caret with the rpart method:
```{r section4, echo=TRUE}
mod_rpart <- train(classe ~ ., method = "rpart", data = TrainSet)
pred_rpart <- predict(mod_rpart, TestSet)
confusionMatrix(pred_rpart, TestSet$classe)
mod_rpart$finalModel
fancyRpartPlot(mod_rpart$finalModel)
```
Second we build a similar model using rpart:
```{r section7, echo=TRUE}
# model fit
set.seed(12345)
modFitDecTree <- rpart(classe ~ ., data=TrainSet, method="class")
fancyRpartPlot(modFitDecTree)
# prediction on Test dataset
predictDecTree <- predict(modFitDecTree, newdata=TestSet, type="class")
confMatDecTree <- confusionMatrix(predictDecTree, TestSet$classe)
confMatDecTree
```
一个简单的解释是您没有调整任何一个模型,并且在默认设置下 rpart 表现更好纯属偶然。
当您使用相同的参数时,您应该期望获得相同的性能。
让我们用 caret
做一些调整:
set.seed(1)
mod_rpart <- train(classe ~ .,
method = "rpart",
data = TrainSet,
tuneLength = 50,
metric = "Accuracy",
trControl = trainControl(method = "repeatedcv",
number = 4,
repeats = 5,
summaryFunction = multiClassSummary,
classProbs = TRUE))
pred_rpart <- predict(mod_rpart, TestSet)
confusionMatrix(pred_rpart, TestSet$classe)
#output
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 4359 243 92 135 38
B 446 2489 299 161 276
C 118 346 2477 300 92
D 190 377 128 2240 368
E 188 152 254 219 2652
Overall Statistics
Accuracy : 0.7628
95% CI : (0.7566, 0.7688)
No Information Rate : 0.2844
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7009
Mcnemar's Test P-Value : < 2.2e-16
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.8223 0.6900 0.7622 0.7332 0.7741
Specificity 0.9619 0.9214 0.9444 0.9318 0.9466
Pos Pred Value 0.8956 0.6780 0.7432 0.6782 0.7654
Neg Pred Value 0.9316 0.9253 0.9495 0.9469 0.9490
Prevalence 0.2844 0.1935 0.1744 0.1639 0.1838
Detection Rate 0.2339 0.1335 0.1329 0.1202 0.1423
Detection Prevalence 0.2611 0.1970 0.1788 0.1772 0.1859
Balanced Accuracy 0.8921 0.8057 0.8533 0.8325 0.8603
这比使用默认设置 (cp = 0.01
)
rpart
好一点
如果我们将最佳 cp 设置为插入符号选择的怎么样:
modFitDecTree <- rpart(classe ~ .,
data = TrainSet,
method = "class",
control = rpart.control(cp = mod_rpart$bestTune))
predictDecTree <- predict(modFitDecTree, newdata = TestSet, type = "class" )
confusionMatrix(predictDecTree, TestSet$classe)
#part of ouput
Accuracy : 0.7628
这个 post 提到由于自举和交叉验证,Caret rpart 比 rpart 更准确:
虽然当我比较这两种方法时,我得到的 Caret rpart 的准确度是 0.4879,rpart 的准确度是 0.7347(我已经复制了下面的代码)。
此外,与 rpart
相比,Caret rpart 的分类树只有几个节点(拆分)有谁明白这些区别吗?
谢谢!
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Loading libraries and the data
This is an R Markdown document. First we load the libraries and the data and split the trainingdata into a training and a testset.
```{r section1, echo=TRUE}
# load libraries
library(knitr)
library(caret)
suppressMessages(library(rattle))
library(rpart.plot)
# set the URL for the download
wwwTrain <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
wwwTest <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
# download the datasets
training <- read.csv(url(wwwTrain))
testing <- read.csv(url(wwwTest))
# create a partition with the training dataset
inTrain <- createDataPartition(training$classe, p=0.05, list=FALSE)
TrainSet <- training[inTrain, ]
TestSet <- training[-inTrain, ]
dim(TrainSet)
# set seed for reproducibility
set.seed(12345)
```
## Cleaning the data
```{r section2, echo=TRUE}
# remove variables with Nearly Zero Variance
NZV <- nearZeroVar(TrainSet)
TrainSet <- TrainSet[, -NZV]
TestSet <- TestSet[, -NZV]
dim(TrainSet)
dim(TestSet)
# remove variables that are mostly NA
AllNA <- sapply(TrainSet, function(x) mean(is.na(x))) > 0.95
TrainSet <- TrainSet[, AllNA==FALSE]
TestSet <- TestSet[, AllNA==FALSE]
dim(TrainSet)
dim(TestSet)
# remove identification only variables (columns 1 to 5)
TrainSet <- TrainSet[, -(1:5)]
TestSet <- TestSet[, -(1:5)]
dim(TrainSet)
```
## Prediction modelling
First we build a classification model using Caret with the rpart method:
```{r section4, echo=TRUE}
mod_rpart <- train(classe ~ ., method = "rpart", data = TrainSet)
pred_rpart <- predict(mod_rpart, TestSet)
confusionMatrix(pred_rpart, TestSet$classe)
mod_rpart$finalModel
fancyRpartPlot(mod_rpart$finalModel)
```
Second we build a similar model using rpart:
```{r section7, echo=TRUE}
# model fit
set.seed(12345)
modFitDecTree <- rpart(classe ~ ., data=TrainSet, method="class")
fancyRpartPlot(modFitDecTree)
# prediction on Test dataset
predictDecTree <- predict(modFitDecTree, newdata=TestSet, type="class")
confMatDecTree <- confusionMatrix(predictDecTree, TestSet$classe)
confMatDecTree
```
一个简单的解释是您没有调整任何一个模型,并且在默认设置下 rpart 表现更好纯属偶然。
当您使用相同的参数时,您应该期望获得相同的性能。
让我们用 caret
做一些调整:
set.seed(1)
mod_rpart <- train(classe ~ .,
method = "rpart",
data = TrainSet,
tuneLength = 50,
metric = "Accuracy",
trControl = trainControl(method = "repeatedcv",
number = 4,
repeats = 5,
summaryFunction = multiClassSummary,
classProbs = TRUE))
pred_rpart <- predict(mod_rpart, TestSet)
confusionMatrix(pred_rpart, TestSet$classe)
#output
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 4359 243 92 135 38
B 446 2489 299 161 276
C 118 346 2477 300 92
D 190 377 128 2240 368
E 188 152 254 219 2652
Overall Statistics
Accuracy : 0.7628
95% CI : (0.7566, 0.7688)
No Information Rate : 0.2844
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7009
Mcnemar's Test P-Value : < 2.2e-16
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.8223 0.6900 0.7622 0.7332 0.7741
Specificity 0.9619 0.9214 0.9444 0.9318 0.9466
Pos Pred Value 0.8956 0.6780 0.7432 0.6782 0.7654
Neg Pred Value 0.9316 0.9253 0.9495 0.9469 0.9490
Prevalence 0.2844 0.1935 0.1744 0.1639 0.1838
Detection Rate 0.2339 0.1335 0.1329 0.1202 0.1423
Detection Prevalence 0.2611 0.1970 0.1788 0.1772 0.1859
Balanced Accuracy 0.8921 0.8057 0.8533 0.8325 0.8603
这比使用默认设置 (cp = 0.01
)
rpart
好一点
如果我们将最佳 cp 设置为插入符号选择的怎么样:
modFitDecTree <- rpart(classe ~ .,
data = TrainSet,
method = "class",
control = rpart.control(cp = mod_rpart$bestTune))
predictDecTree <- predict(modFitDecTree, newdata = TestSet, type = "class" )
confusionMatrix(predictDecTree, TestSet$classe)
#part of ouput
Accuracy : 0.7628