mlr 包中的 makeClassifTask() 中如何包含阻塞因子?

How can a blocking factor be included in makeClassifTask() from mlr package?

在一些分类任务中,使用mlr包,我需要处理一个data.frame类似于这个:

set.seed(pi)
# Dummy data frame
df <- data.frame(
   # Repeated values ID
   ID = sort(sample(c(0:20), 100, replace = TRUE)),
   # Some variables
   X1 = runif(10, 1, 10),
   # Some Label
   Label = sample(c(0,1), 100, replace = TRUE)
   )
df 

我需要交叉验证模型,将相同的值保持在一起 ID,我从教程中了解到:

https://mlr-org.github.io/mlr-tutorial/release/html/task/index.html#further-settings

We could include a blocking factor in the task. This would indicate that some observations "belong together" and should not be separated when splitting the data into training and test sets for resampling.

问题是如何在 makeClassifTask 中包含这个阻塞因子?

很遗憾,我找不到任何示例。

你有什么版本的 mlr?阻塞应该是它的一部分。您可以直接在 makeClassifTask

中找到它作为参数

这是您的数据示例:

df$ID = as.factor(df$ID)
df2 = df
df2$ID = NULL
df2$Label = as.factor(df$Label)
tsk = makeClassifTask(data = df2, target = "Label", blocking = df$ID)
res = resample("classif.rpart", tsk, resampling = cv10)

# to prove-check that blocking worked
lapply(1:10, function(i) {
  blocks.training = df$ID[res$pred$instance$train.inds[[i]]]
  blocks.testing = df$ID[res$pred$instance$test.inds[[i]]]
  intersect(blocks.testing, blocks.training)
})
#all entries are empty, blocking indeed works! 

@jakob-r 的回答不再有效。我的猜测是 cv10 改变了一些东西。

要使用的小编辑 "blocking.cv = TRUE"

完整的工作示例:

set.seed(pi)
# Dummy data frame
df <- data.frame(
   # Repeated values ID
   ID = sort(sample(c(0:20), 100, replace = TRUE)),
   # Some variables
   X1 = runif(10, 1, 10),
   # Some Label
   Label = sample(c(0,1), 100, replace = TRUE)
   )
df 

df$ID = as.factor(df$ID)
df2 = df
df2$ID = NULL
df2$Label = as.factor(df$Label)
resDesc <- makeResampleDesc("CV",iters=10,blocking.cv = TRUE)
tsk = makeClassifTask(data = df2, target = "Label", blocking = df$ID)
res = resample("classif.rpart", tsk, resampling = resDesc)

# to prove-check that blocking worked
lapply(1:10, function(i) {
  blocks.training = df$ID[res$pred$instance$train.inds[[i]]]
  blocks.testing = df$ID[res$pred$instance$test.inds[[i]]]
  intersect(blocks.testing, blocks.training)
})