订购任务时如何使用带上采样的管道图?

How can I use a pipeline graph with upsampling when my task is ordered?

我有一个任务,其中行中的观察有日期顺序。我生成了一个自定义重采样方案,在所有 train/test 拆分中都遵循此顺序。

我还想通过对少数 class 进行上采样来解决不平衡 classes 问题。在训练集中,时间顺序并不重要(学习者无论如何也不会使用它)。

现在,我想对这种有序任务、图形学习器(包括上采样)和时间敏感的自定义重采样方案的组合进行重采样。但这是有问题的。

为了展示这一点,我生成了以下代码。我使用一个示例任务来使其可重现,并使用日期列扩充此任务以生成与我的问题类似的有序任务。仅当我省略代码中指示的有问题的行时,此代码才会运行。但它们生成的正是我在现实世界中遇到的问题:一个命令。那么我该如何解决呢?

(为了便于阅读,我省略了以下 reprex 中的一些输出。)

library(mlr3verse)
#> Warning: Paket 'mlr3verse' wurde unter R Version 4.1.1 erstellt
#> Lade nötiges Paket: mlr3

library(tidyverse)

library(lubridate)


# load sample task 

task <- tsk("breast_cancer")


#### start of lines that generate a problem

# add a date column to produce an artificial sample problem with time order of rows specified by a date column
DateColumn <- seq(ymd('2000-04-07'),ymd('2021-03-22'), by = '1 week')
DateColumn <- DateColumn[1:task$nrow]
task$cbind(data.frame(Date = DateColumn[1:task$nrow])) # add date column
task$set_col_roles("Date", roles = "order")

#### end of lines that generate a problem


# Generate a "loo" growing window type resampling scheme, where learner is trained on "earlier" and tested on "later" data  (hopefully - or may it be that the original row order is not preserved?)
# first training window size is 10 weeks

length_first_window <- 10

resampling_grow_win = rsmp("custom")

train_sets = list(1:length_first_window)    
test_sets = list(length_first_window+1)        

for (testweek in ((length_first_window+2):task$nrow)) {
  
  train_sets <- append(train_sets, list(c(1:(testweek-1))))
  test_sets <- append(test_sets, list(c(testweek)))
  
}

resampling_grow_win$instantiate(task, train_sets, test_sets)
resampling_grow_win$id <- paste0("gw_for", task$id)



# now, I define a pipeline for a learner with preceding upsampling:

# oversample or undersample such that the number of cases is equal
# I assume here that the actual number is not really important and 
# use approximately the size of the total sample (which will be much larger than the sample size
# for the first growing window resamplings, but I am oversampling anyway)


NrSamplesEach <- 500

po_classbalance = po("classbalancing",
  id = "sample2equal", adjust = "all", shuffle = FALSE, ratio = NrSamplesEach, reference="one")


#create a learner
learner = lrn("classif.ranger", num.trees = 10)


# combine learner with pipeline graph
learner_balanced = as_learner(po_classbalance %>>% learner)

# setup benchmark
rr = resample(task, learner_balanced, resampling_grow_win, store_models = TRUE) 
#> INFO  [16:47:10.484] [mlr3] Applying learner 'sample2equal.classif.ranger' on task 'breast_cancer' (iter 111/673)
#> Error: Cannot rbind data to task 'breast_cancer', missing the following mandatory columns: Date
#> This happened PipeOp sample2equal's $train()

# show some results (I am aware that there is only one element in the test set in each iteration, but this is ok for this example)

scored_result <- rr$score(msr("classif.acc"))
#> Error in eval(expr, envir, enclos): Objekt 'rr' nicht gefunden
head(scored_result)
#> Error in head(scored_result): Objekt 'scored_result' nicht gefunden

reprex package (v2.0.1)

于 2021-12-09 创建

您可以先对数据集进行上采样,然后创建自定义重采样拆分。

library(mlr3)
library(mlr3misc)
library(lubridate)

task = tsk("breast_cancer")

# set date column
DateColumn = seq(ymd('2000-04-07'),ymd('2021-03-22'), by = '1 week')
DateColumn = DateColumn[1:task$nrow]
task$cbind(data.frame(Date = DateColumn[1:task$nrow]))

# upsample task
po_classbalance = po("classbalancing", id = "sample2equal", adjust = "all", shuffle = FALSE, ratio = NrSamplesEach, reference="one")
task = po_classbalance$train(list(task))[[1]]

# add helper column to indicate position in unordered data table
task$cbind(data.frame(i = 1:task$nrow))

# set order
task$set_col_roles("Date", roles = "order")

# custom resampling
train_sets = map(seq(10, task$nrow - 1), seq)
test_sets = as.list(seq(11, task$nrow))

# map to position of unordered data table
data_ordered = task$data(order = TRUE)
train_sets = map(train_sets, function(x) data_ordered$i[x])
test_sets = map(test_sets, function(x) data_ordered$i[x])

# remove helper column
task$select(setdiff(task$feature_names, "i"))

learner = lrn("classif.rpart")
rr = resample(task, learner, resampling_grow_win, store_models = TRUE)

是的,对不起,我弄错了。我们需要修复 pipeop。但是,您可以先对数据进行排序并跳过 task$set_col_roles("Date", roles = "order") 部分。为了安全起见,请检查 task$data(row) 您的数据是否按时间顺序返回,例如task$data(1)returns第一个时间点

library(mlr3)
library(mlr3pipelines)
library(data.table)
library(mlr3misc)

task = tsk("breast_cancer")
learner = lrn("classif.rpart")
resampling = rsmp("holdout")

# extract data
data = task$data()

# fake date column
date = sample(seq(task$nrow))
data[, date := date]

# order data in chronological order
setorder(data, date)

# remove date column
data[, date := NULL]

# create task with ordered data
task = as_task_classif(data, target = "class")

# set custom resampling
train_sets = map(seq(10, task$nrow - 1), seq)
test_sets = as.list(seq(11, task$nrow))
resampling = rsmp("custom")
resampling$instantiate(task, train_sets, test_sets)

# learner with upsampling
po_classbalance = po("classbalancing", id = "sample2equal", adjust = "all", shuffle = FALSE, ratio = 500, reference="one")
learner_balanced = as_learner(po_classbalance %>>% learner)

# resample
rr = resample(task, learner_balanced, resampling, store_models = TRUE)