订购任务时如何使用带上采样的管道图?
How can I use a pipeline graph with upsampling when my task is ordered?
我有一个任务,其中行中的观察有日期顺序。我生成了一个自定义重采样方案,在所有 train/test 拆分中都遵循此顺序。
我还想通过对少数 class 进行上采样来解决不平衡 classes 问题。在训练集中,时间顺序并不重要(学习者无论如何也不会使用它)。
现在,我想对这种有序任务、图形学习器(包括上采样)和时间敏感的自定义重采样方案的组合进行重采样。但这是有问题的。
为了展示这一点,我生成了以下代码。我使用一个示例任务来使其可重现,并使用日期列扩充此任务以生成与我的问题类似的有序任务。仅当我省略代码中指示的有问题的行时,此代码才会运行。但它们生成的正是我在现实世界中遇到的问题:一个命令。那么我该如何解决呢?
(为了便于阅读,我省略了以下 reprex 中的一些输出。)
library(mlr3verse)
#> Warning: Paket 'mlr3verse' wurde unter R Version 4.1.1 erstellt
#> Lade nötiges Paket: mlr3
library(tidyverse)
library(lubridate)
# load sample task
task <- tsk("breast_cancer")
#### start of lines that generate a problem
# add a date column to produce an artificial sample problem with time order of rows specified by a date column
DateColumn <- seq(ymd('2000-04-07'),ymd('2021-03-22'), by = '1 week')
DateColumn <- DateColumn[1:task$nrow]
task$cbind(data.frame(Date = DateColumn[1:task$nrow])) # add date column
task$set_col_roles("Date", roles = "order")
#### end of lines that generate a problem
# Generate a "loo" growing window type resampling scheme, where learner is trained on "earlier" and tested on "later" data (hopefully - or may it be that the original row order is not preserved?)
# first training window size is 10 weeks
length_first_window <- 10
resampling_grow_win = rsmp("custom")
train_sets = list(1:length_first_window)
test_sets = list(length_first_window+1)
for (testweek in ((length_first_window+2):task$nrow)) {
train_sets <- append(train_sets, list(c(1:(testweek-1))))
test_sets <- append(test_sets, list(c(testweek)))
}
resampling_grow_win$instantiate(task, train_sets, test_sets)
resampling_grow_win$id <- paste0("gw_for", task$id)
# now, I define a pipeline for a learner with preceding upsampling:
# oversample or undersample such that the number of cases is equal
# I assume here that the actual number is not really important and
# use approximately the size of the total sample (which will be much larger than the sample size
# for the first growing window resamplings, but I am oversampling anyway)
NrSamplesEach <- 500
po_classbalance = po("classbalancing",
id = "sample2equal", adjust = "all", shuffle = FALSE, ratio = NrSamplesEach, reference="one")
#create a learner
learner = lrn("classif.ranger", num.trees = 10)
# combine learner with pipeline graph
learner_balanced = as_learner(po_classbalance %>>% learner)
# setup benchmark
rr = resample(task, learner_balanced, resampling_grow_win, store_models = TRUE)
#> INFO [16:47:10.484] [mlr3] Applying learner 'sample2equal.classif.ranger' on task 'breast_cancer' (iter 111/673)
#> Error: Cannot rbind data to task 'breast_cancer', missing the following mandatory columns: Date
#> This happened PipeOp sample2equal's $train()
# show some results (I am aware that there is only one element in the test set in each iteration, but this is ok for this example)
scored_result <- rr$score(msr("classif.acc"))
#> Error in eval(expr, envir, enclos): Objekt 'rr' nicht gefunden
head(scored_result)
#> Error in head(scored_result): Objekt 'scored_result' nicht gefunden
由 reprex package (v2.0.1)
于 2021-12-09 创建
您可以先对数据集进行上采样,然后创建自定义重采样拆分。
library(mlr3)
library(mlr3misc)
library(lubridate)
task = tsk("breast_cancer")
# set date column
DateColumn = seq(ymd('2000-04-07'),ymd('2021-03-22'), by = '1 week')
DateColumn = DateColumn[1:task$nrow]
task$cbind(data.frame(Date = DateColumn[1:task$nrow]))
# upsample task
po_classbalance = po("classbalancing", id = "sample2equal", adjust = "all", shuffle = FALSE, ratio = NrSamplesEach, reference="one")
task = po_classbalance$train(list(task))[[1]]
# add helper column to indicate position in unordered data table
task$cbind(data.frame(i = 1:task$nrow))
# set order
task$set_col_roles("Date", roles = "order")
# custom resampling
train_sets = map(seq(10, task$nrow - 1), seq)
test_sets = as.list(seq(11, task$nrow))
# map to position of unordered data table
data_ordered = task$data(order = TRUE)
train_sets = map(train_sets, function(x) data_ordered$i[x])
test_sets = map(test_sets, function(x) data_ordered$i[x])
# remove helper column
task$select(setdiff(task$feature_names, "i"))
learner = lrn("classif.rpart")
rr = resample(task, learner, resampling_grow_win, store_models = TRUE)
是的,对不起,我弄错了。我们需要修复 pipeop。但是,您可以先对数据进行排序并跳过 task$set_col_roles("Date", roles = "order")
部分。为了安全起见,请检查 task$data(row)
您的数据是否按时间顺序返回,例如task$data(1)
returns第一个时间点
library(mlr3)
library(mlr3pipelines)
library(data.table)
library(mlr3misc)
task = tsk("breast_cancer")
learner = lrn("classif.rpart")
resampling = rsmp("holdout")
# extract data
data = task$data()
# fake date column
date = sample(seq(task$nrow))
data[, date := date]
# order data in chronological order
setorder(data, date)
# remove date column
data[, date := NULL]
# create task with ordered data
task = as_task_classif(data, target = "class")
# set custom resampling
train_sets = map(seq(10, task$nrow - 1), seq)
test_sets = as.list(seq(11, task$nrow))
resampling = rsmp("custom")
resampling$instantiate(task, train_sets, test_sets)
# learner with upsampling
po_classbalance = po("classbalancing", id = "sample2equal", adjust = "all", shuffle = FALSE, ratio = 500, reference="one")
learner_balanced = as_learner(po_classbalance %>>% learner)
# resample
rr = resample(task, learner_balanced, resampling, store_models = TRUE)
我有一个任务,其中行中的观察有日期顺序。我生成了一个自定义重采样方案,在所有 train/test 拆分中都遵循此顺序。
我还想通过对少数 class 进行上采样来解决不平衡 classes 问题。在训练集中,时间顺序并不重要(学习者无论如何也不会使用它)。
现在,我想对这种有序任务、图形学习器(包括上采样)和时间敏感的自定义重采样方案的组合进行重采样。但这是有问题的。
为了展示这一点,我生成了以下代码。我使用一个示例任务来使其可重现,并使用日期列扩充此任务以生成与我的问题类似的有序任务。仅当我省略代码中指示的有问题的行时,此代码才会运行。但它们生成的正是我在现实世界中遇到的问题:一个命令。那么我该如何解决呢?
(为了便于阅读,我省略了以下 reprex 中的一些输出。)
library(mlr3verse)
#> Warning: Paket 'mlr3verse' wurde unter R Version 4.1.1 erstellt
#> Lade nötiges Paket: mlr3
library(tidyverse)
library(lubridate)
# load sample task
task <- tsk("breast_cancer")
#### start of lines that generate a problem
# add a date column to produce an artificial sample problem with time order of rows specified by a date column
DateColumn <- seq(ymd('2000-04-07'),ymd('2021-03-22'), by = '1 week')
DateColumn <- DateColumn[1:task$nrow]
task$cbind(data.frame(Date = DateColumn[1:task$nrow])) # add date column
task$set_col_roles("Date", roles = "order")
#### end of lines that generate a problem
# Generate a "loo" growing window type resampling scheme, where learner is trained on "earlier" and tested on "later" data (hopefully - or may it be that the original row order is not preserved?)
# first training window size is 10 weeks
length_first_window <- 10
resampling_grow_win = rsmp("custom")
train_sets = list(1:length_first_window)
test_sets = list(length_first_window+1)
for (testweek in ((length_first_window+2):task$nrow)) {
train_sets <- append(train_sets, list(c(1:(testweek-1))))
test_sets <- append(test_sets, list(c(testweek)))
}
resampling_grow_win$instantiate(task, train_sets, test_sets)
resampling_grow_win$id <- paste0("gw_for", task$id)
# now, I define a pipeline for a learner with preceding upsampling:
# oversample or undersample such that the number of cases is equal
# I assume here that the actual number is not really important and
# use approximately the size of the total sample (which will be much larger than the sample size
# for the first growing window resamplings, but I am oversampling anyway)
NrSamplesEach <- 500
po_classbalance = po("classbalancing",
id = "sample2equal", adjust = "all", shuffle = FALSE, ratio = NrSamplesEach, reference="one")
#create a learner
learner = lrn("classif.ranger", num.trees = 10)
# combine learner with pipeline graph
learner_balanced = as_learner(po_classbalance %>>% learner)
# setup benchmark
rr = resample(task, learner_balanced, resampling_grow_win, store_models = TRUE)
#> INFO [16:47:10.484] [mlr3] Applying learner 'sample2equal.classif.ranger' on task 'breast_cancer' (iter 111/673)
#> Error: Cannot rbind data to task 'breast_cancer', missing the following mandatory columns: Date
#> This happened PipeOp sample2equal's $train()
# show some results (I am aware that there is only one element in the test set in each iteration, but this is ok for this example)
scored_result <- rr$score(msr("classif.acc"))
#> Error in eval(expr, envir, enclos): Objekt 'rr' nicht gefunden
head(scored_result)
#> Error in head(scored_result): Objekt 'scored_result' nicht gefunden
由 reprex package (v2.0.1)
于 2021-12-09 创建您可以先对数据集进行上采样,然后创建自定义重采样拆分。
library(mlr3)
library(mlr3misc)
library(lubridate)
task = tsk("breast_cancer")
# set date column
DateColumn = seq(ymd('2000-04-07'),ymd('2021-03-22'), by = '1 week')
DateColumn = DateColumn[1:task$nrow]
task$cbind(data.frame(Date = DateColumn[1:task$nrow]))
# upsample task
po_classbalance = po("classbalancing", id = "sample2equal", adjust = "all", shuffle = FALSE, ratio = NrSamplesEach, reference="one")
task = po_classbalance$train(list(task))[[1]]
# add helper column to indicate position in unordered data table
task$cbind(data.frame(i = 1:task$nrow))
# set order
task$set_col_roles("Date", roles = "order")
# custom resampling
train_sets = map(seq(10, task$nrow - 1), seq)
test_sets = as.list(seq(11, task$nrow))
# map to position of unordered data table
data_ordered = task$data(order = TRUE)
train_sets = map(train_sets, function(x) data_ordered$i[x])
test_sets = map(test_sets, function(x) data_ordered$i[x])
# remove helper column
task$select(setdiff(task$feature_names, "i"))
learner = lrn("classif.rpart")
rr = resample(task, learner, resampling_grow_win, store_models = TRUE)
是的,对不起,我弄错了。我们需要修复 pipeop。但是,您可以先对数据进行排序并跳过 task$set_col_roles("Date", roles = "order")
部分。为了安全起见,请检查 task$data(row)
您的数据是否按时间顺序返回,例如task$data(1)
returns第一个时间点
library(mlr3)
library(mlr3pipelines)
library(data.table)
library(mlr3misc)
task = tsk("breast_cancer")
learner = lrn("classif.rpart")
resampling = rsmp("holdout")
# extract data
data = task$data()
# fake date column
date = sample(seq(task$nrow))
data[, date := date]
# order data in chronological order
setorder(data, date)
# remove date column
data[, date := NULL]
# create task with ordered data
task = as_task_classif(data, target = "class")
# set custom resampling
train_sets = map(seq(10, task$nrow - 1), seq)
test_sets = as.list(seq(11, task$nrow))
resampling = rsmp("custom")
resampling$instantiate(task, train_sets, test_sets)
# learner with upsampling
po_classbalance = po("classbalancing", id = "sample2equal", adjust = "all", shuffle = FALSE, ratio = 500, reference="one")
learner_balanced = as_learner(po_classbalance %>>% learner)
# resample
rr = resample(task, learner_balanced, resampling, store_models = TRUE)