mlr:使用验证集调整模型参数
mlr: Tune model parameters with validation set
我的机器学习工作流程刚刚切换到 mlr。我想知道是否可以使用单独的验证集来调整超参数。根据我的最低理解,makeResampleDesc
和 makeResampleInstance
只接受训练数据的重采样。
我的目标是使用验证集调整参数并使用测试集测试最终模型。这是为了防止过度拟合和知识泄漏。
这是我在代码方面所做的:
## Create training, validation and test tasks
train_task <- makeClassifTask(data = train_data, target = "y", positive = 1)
validation_task <- makeClassifTask(data = validation_data, target = "y")
test_task <- makeClassifTask(data = test_data, target = "y")
## Attempt to tune parameters with separate validation data
tuned_params <- tuneParams(
task = train_task,
resampling = makeResampleInstance("Holdout", task = validation_task),
...
)
从错误消息来看,评估似乎仍在尝试从训练集中重新采样:
00001: Error in resample.fun(learner2, task, resampling, measures =
measures, : Size of data set: 19454 and resampling instance:
1666333 differ!
有人知道我该怎么做吗?我是否以正确的方式设置了所有内容?
[更新于 2019/03/27]
根据@jakob-r 的评论,终于理解了@LarsKotthoff 的建议,这是我所做的:
## Create combined training data
train_task_data <- rbind(train_data, validation_data)
## Create learner, training task, etc.
xgb_learner <- makeLearner("classif.xgboost", predict.type = "prob")
train_task <- makeClassifTask(data = train_task_data, target = "y", positive = 1)
## Tune hyperparameters
tune_wrapper <- makeTuneWrapper(
learner = xgb_learner,
resampling = makeResampleDesc("Holdout"),
measures = ...,
par.set = ...,
control = ...
)
model_xgb <- train(tune_wrapper, train_task)
这是我根据@LarsKotthoff 的评论所做的。假设您有两个单独的数据集用于训练 (train_data
) 和验证 (validation_data
):
## Create combined training data
train_task_data <- rbind(train_data, validation_data)
size <- nrow(train_task_data)
train_ind <- seq_len(nrow(train_data))
validation_ind <- seq.int(max(train_ind) + 1, size)
## Create training task
train_task <- makeClassifTask(data = train_task_data, target = "y", positive = 1)
## Tune hyperparameters
tuned_params <- tuneParams(
task = train_task,
resampling = makeFixedHoldoutInstance(train_ind, validation_ind, size),
...
)
优化超参数集后,您可以构建最终模型并针对您的测试数据集进行测试。
注意:我必须安装来自GitHub的最新开发版本(截至2018/08/06)。当前 CRAN 版本 (2.12.1) 在我调用 makeFixedHoldoutInstance()
时抛出错误,即
Assertion on 'discrete.names' failed: Must be of type 'logical flag',
not 'NULL'.
我的机器学习工作流程刚刚切换到 mlr。我想知道是否可以使用单独的验证集来调整超参数。根据我的最低理解,makeResampleDesc
和 makeResampleInstance
只接受训练数据的重采样。
我的目标是使用验证集调整参数并使用测试集测试最终模型。这是为了防止过度拟合和知识泄漏。
这是我在代码方面所做的:
## Create training, validation and test tasks
train_task <- makeClassifTask(data = train_data, target = "y", positive = 1)
validation_task <- makeClassifTask(data = validation_data, target = "y")
test_task <- makeClassifTask(data = test_data, target = "y")
## Attempt to tune parameters with separate validation data
tuned_params <- tuneParams(
task = train_task,
resampling = makeResampleInstance("Holdout", task = validation_task),
...
)
从错误消息来看,评估似乎仍在尝试从训练集中重新采样:
00001: Error in resample.fun(learner2, task, resampling, measures = measures, : Size of data set: 19454 and resampling instance: 1666333 differ!
有人知道我该怎么做吗?我是否以正确的方式设置了所有内容?
[更新于 2019/03/27]
根据@jakob-r 的评论,终于理解了@LarsKotthoff 的建议,这是我所做的:
## Create combined training data
train_task_data <- rbind(train_data, validation_data)
## Create learner, training task, etc.
xgb_learner <- makeLearner("classif.xgboost", predict.type = "prob")
train_task <- makeClassifTask(data = train_task_data, target = "y", positive = 1)
## Tune hyperparameters
tune_wrapper <- makeTuneWrapper(
learner = xgb_learner,
resampling = makeResampleDesc("Holdout"),
measures = ...,
par.set = ...,
control = ...
)
model_xgb <- train(tune_wrapper, train_task)
这是我根据@LarsKotthoff 的评论所做的。假设您有两个单独的数据集用于训练 (train_data
) 和验证 (validation_data
):
## Create combined training data
train_task_data <- rbind(train_data, validation_data)
size <- nrow(train_task_data)
train_ind <- seq_len(nrow(train_data))
validation_ind <- seq.int(max(train_ind) + 1, size)
## Create training task
train_task <- makeClassifTask(data = train_task_data, target = "y", positive = 1)
## Tune hyperparameters
tuned_params <- tuneParams(
task = train_task,
resampling = makeFixedHoldoutInstance(train_ind, validation_ind, size),
...
)
优化超参数集后,您可以构建最终模型并针对您的测试数据集进行测试。
注意:我必须安装来自GitHub的最新开发版本(截至2018/08/06)。当前 CRAN 版本 (2.12.1) 在我调用 makeFixedHoldoutInstance()
时抛出错误,即
Assertion on 'discrete.names' failed: Must be of type 'logical flag', not 'NULL'.