mlr 包中的 makeClassifTask() 中如何包含阻塞因子?
How can a blocking factor be included in makeClassifTask() from mlr package?
在一些分类任务中,使用mlr
包,我需要处理一个data.frame
类似于这个:
set.seed(pi)
# Dummy data frame
df <- data.frame(
# Repeated values ID
ID = sort(sample(c(0:20), 100, replace = TRUE)),
# Some variables
X1 = runif(10, 1, 10),
# Some Label
Label = sample(c(0,1), 100, replace = TRUE)
)
df
我需要交叉验证模型,将相同的值保持在一起 ID
,我从教程中了解到:
https://mlr-org.github.io/mlr-tutorial/release/html/task/index.html#further-settings
We could include a blocking factor in the task. This would indicate that some observations "belong together" and should not be separated when splitting the data into training and test sets for resampling.
问题是如何在 makeClassifTask
中包含这个阻塞因子?
很遗憾,我找不到任何示例。
你有什么版本的 mlr?阻塞应该是它的一部分。您可以直接在 makeClassifTask
中找到它作为参数
这是您的数据示例:
df$ID = as.factor(df$ID)
df2 = df
df2$ID = NULL
df2$Label = as.factor(df$Label)
tsk = makeClassifTask(data = df2, target = "Label", blocking = df$ID)
res = resample("classif.rpart", tsk, resampling = cv10)
# to prove-check that blocking worked
lapply(1:10, function(i) {
blocks.training = df$ID[res$pred$instance$train.inds[[i]]]
blocks.testing = df$ID[res$pred$instance$test.inds[[i]]]
intersect(blocks.testing, blocks.training)
})
#all entries are empty, blocking indeed works!
@jakob-r 的回答不再有效。我的猜测是 cv10 改变了一些东西。
要使用的小编辑 "blocking.cv = TRUE"
完整的工作示例:
set.seed(pi)
# Dummy data frame
df <- data.frame(
# Repeated values ID
ID = sort(sample(c(0:20), 100, replace = TRUE)),
# Some variables
X1 = runif(10, 1, 10),
# Some Label
Label = sample(c(0,1), 100, replace = TRUE)
)
df
df$ID = as.factor(df$ID)
df2 = df
df2$ID = NULL
df2$Label = as.factor(df$Label)
resDesc <- makeResampleDesc("CV",iters=10,blocking.cv = TRUE)
tsk = makeClassifTask(data = df2, target = "Label", blocking = df$ID)
res = resample("classif.rpart", tsk, resampling = resDesc)
# to prove-check that blocking worked
lapply(1:10, function(i) {
blocks.training = df$ID[res$pred$instance$train.inds[[i]]]
blocks.testing = df$ID[res$pred$instance$test.inds[[i]]]
intersect(blocks.testing, blocks.training)
})
在一些分类任务中,使用mlr
包,我需要处理一个data.frame
类似于这个:
set.seed(pi)
# Dummy data frame
df <- data.frame(
# Repeated values ID
ID = sort(sample(c(0:20), 100, replace = TRUE)),
# Some variables
X1 = runif(10, 1, 10),
# Some Label
Label = sample(c(0,1), 100, replace = TRUE)
)
df
我需要交叉验证模型,将相同的值保持在一起 ID
,我从教程中了解到:
https://mlr-org.github.io/mlr-tutorial/release/html/task/index.html#further-settings
We could include a blocking factor in the task. This would indicate that some observations "belong together" and should not be separated when splitting the data into training and test sets for resampling.
问题是如何在 makeClassifTask
中包含这个阻塞因子?
很遗憾,我找不到任何示例。
你有什么版本的 mlr?阻塞应该是它的一部分。您可以直接在 makeClassifTask
这是您的数据示例:
df$ID = as.factor(df$ID)
df2 = df
df2$ID = NULL
df2$Label = as.factor(df$Label)
tsk = makeClassifTask(data = df2, target = "Label", blocking = df$ID)
res = resample("classif.rpart", tsk, resampling = cv10)
# to prove-check that blocking worked
lapply(1:10, function(i) {
blocks.training = df$ID[res$pred$instance$train.inds[[i]]]
blocks.testing = df$ID[res$pred$instance$test.inds[[i]]]
intersect(blocks.testing, blocks.training)
})
#all entries are empty, blocking indeed works!
@jakob-r 的回答不再有效。我的猜测是 cv10 改变了一些东西。
要使用的小编辑 "blocking.cv = TRUE"
完整的工作示例:
set.seed(pi)
# Dummy data frame
df <- data.frame(
# Repeated values ID
ID = sort(sample(c(0:20), 100, replace = TRUE)),
# Some variables
X1 = runif(10, 1, 10),
# Some Label
Label = sample(c(0,1), 100, replace = TRUE)
)
df
df$ID = as.factor(df$ID)
df2 = df
df2$ID = NULL
df2$Label = as.factor(df$Label)
resDesc <- makeResampleDesc("CV",iters=10,blocking.cv = TRUE)
tsk = makeClassifTask(data = df2, target = "Label", blocking = df$ID)
res = resample("classif.rpart", tsk, resampling = resDesc)
# to prove-check that blocking worked
lapply(1:10, function(i) {
blocks.training = df$ID[res$pred$instance$train.inds[[i]]]
blocks.testing = df$ID[res$pred$instance$test.inds[[i]]]
intersect(blocks.testing, blocks.training)
})