网格搜索以找到决策树分类的最佳参数
Grid Search to find the best parameters for decision tree classification
我有一个数据集,其目标变量是Target
。我将数据集拆分为训练集和测试集,并应用了决策树分类:
library(rpart)
classifier = rpart(formula = Target ~ .,data = training_set)
我想应用网格搜索来找到最佳参数,然后我写:
library(caret)
classifier = train(form = Target ~ ., data = training_set, method = 'ctree')
获得
>classifier
Conditional Inference Tree
8792 samples
8 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 8792, 8792, 8792, 8792, 8792, 8792, ...
Resampling results across tuning parameters:
mincriterion Accuracy Kappa
0.01 0.8881768 0.4373290
0.50 0.8936227 0.4350515
0.99 0.8927400 0.4102918
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mincriterion = 0.5.
和
>classifier$bestTune
mincriterion
2 0.5
现在如何使用这个值来改进我的模型?
set.seed(123)
classifier = train(form = Target ~ .,
data = training_set,
method = 'ctree',
tuneGrid = data.frame(mincriterion = seq(0.01,0.99,length.out = 100)),
trControl = trainControl(method = "boot",
summaryFunction = defaultSummary,
verboseIter = TRUE))
我为你的调整网格添加了一个非常广泛的范围,但由于最佳模型的最小标准为 0.5,你可能希望限制范围。您还可以将 tuneGrid = data.frame()
替换为 tuneLength = 100
,例如插入符号在您不需要指定最小标准数字的情况下自动选择 100 的网格。您还可以将摘要函数从显示准确性和 Kappa 的 defaultSummary 更改为 twoClassSummary,后者将为您提供灵敏度、特异性和 ROC 等指标。如果您确实使用 twoClassSummary,请在 trainControl()
中设置 classProbs = TRUE
。您还可以将方法从 boot 更改为任何折叠的 cv。看看 ?trainControl
。最后在模型调整期间设置种子以实现可重复性。
我有一个数据集,其目标变量是Target
。我将数据集拆分为训练集和测试集,并应用了决策树分类:
library(rpart)
classifier = rpart(formula = Target ~ .,data = training_set)
我想应用网格搜索来找到最佳参数,然后我写:
library(caret)
classifier = train(form = Target ~ ., data = training_set, method = 'ctree')
获得
>classifier
Conditional Inference Tree
8792 samples
8 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 8792, 8792, 8792, 8792, 8792, 8792, ...
Resampling results across tuning parameters:
mincriterion Accuracy Kappa
0.01 0.8881768 0.4373290
0.50 0.8936227 0.4350515
0.99 0.8927400 0.4102918
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mincriterion = 0.5.
和
>classifier$bestTune
mincriterion
2 0.5
现在如何使用这个值来改进我的模型?
set.seed(123)
classifier = train(form = Target ~ .,
data = training_set,
method = 'ctree',
tuneGrid = data.frame(mincriterion = seq(0.01,0.99,length.out = 100)),
trControl = trainControl(method = "boot",
summaryFunction = defaultSummary,
verboseIter = TRUE))
我为你的调整网格添加了一个非常广泛的范围,但由于最佳模型的最小标准为 0.5,你可能希望限制范围。您还可以将 tuneGrid = data.frame()
替换为 tuneLength = 100
,例如插入符号在您不需要指定最小标准数字的情况下自动选择 100 的网格。您还可以将摘要函数从显示准确性和 Kappa 的 defaultSummary 更改为 twoClassSummary,后者将为您提供灵敏度、特异性和 ROC 等指标。如果您确实使用 twoClassSummary,请在 trainControl()
中设置 classProbs = TRUE
。您还可以将方法从 boot 更改为任何折叠的 cv。看看 ?trainControl
。最后在模型调整期间设置种子以实现可重复性。