如何使用 expand.grid 值到运行 R 中游侠的各种模型超参数组合

Question

我看过很多关于如何使用 expand.grid select 模型的自变量然后基于该 selection 创建公式的帖子。但是，我事先准备好输入表并将它们存储在列表中。

library(ranger)
data(iris)
Input_list <- list(iris1 = iris, iris2 = iris)  # let's assume these are different input tables

我很有兴趣为我的输入表列表尝试给定算法的所有可能的超参数组合（此处：使用 ranger 的随机森林）。我执行以下操作来设置网格：

hyper_grid <- expand.grid(
  Input_table = names(Input_list),
  Trees = c(10, 20),
  Importance = c("none", "impurity"),
  Classification = TRUE,
  Repeats = 1:5,
  Target = "Species")

> head(hyper_grid)
  Input_table Trees Importance Classification Repeats  Target
1       iris1    10       none           TRUE       1 Species
2       iris2    10       none           TRUE       1 Species
3       iris1    20       none           TRUE       1 Species
4       iris2    20       none           TRUE       1 Species
5       iris1    10   impurity           TRUE       1 Species
6       iris2    10   impurity           TRUE       1 Species

我的问题是，将此值传递给模型的最佳方式是什么？目前我正在使用 for loop:

for (i in 1:nrow(hyper_grid)) {
  RF_train <- ranger(
    dependent.variable.name = hyper_grid[i, "Target"], 
    data = Input_list[[hyper_grid[i, "Input_table"]]],  # referring to the named object in the list
    num.trees = hyper_grid[i, "Trees"], 
    importance = hyper_grid[i, "Importance"], 
    classification = hyper_grid[i, "Classification"])  # otherwise regression is performed
  print(RF_train)
}

迭代网格的每一行。但首先，我现在必须告诉模型是分类还是回归。我假设因子 Species 已转换为数字因子水平，因此默认情况下会发生回归。有没有办法防止这种情况并使用例如apply 这个角色？这种迭代方式也会导致函数调用混乱：

Call:
 ranger(dependent.variable.name = hyper_grid[i, "Target"], data = Input_list[[hyper_grid[i,      "Input_table"]]], num.trees = hyper_grid[i, "Trees"], importance = hyper_grid[i,      "Importance"], classification = hyper_grid[i, "Classification"])

其次：实际上，模型的输出显然没有打印出来，但我立即捕获了重要的结果（主要是RF_train$confusion.matrix）并将结果写入扩展版本的[=22] =] 与输入参数在同一行。这种性能是明智的还是昂贵的？因为如果我存储游侠对象，我运行有时会遇到内存问题。

谢谢！

Answer 1

我认为将所需值的训练和提取包装到一个函数中是最干净的。需要点 (...) 才能与下面的 purrr::pmap 函数一起使用。

fit_and_extract_metrics <- function(Target, Input_table, Trees, Importance, Classification, ...) {
  RF_train <- ranger(
    dependent.variable.name = Target, 
    data = Input_list[[Input_table]],  # referring to the named object in the list
    num.trees = Trees, 
    importance = Importance, 
    classification = Classification)  # otherwise regression is performed

  data.frame(Prediction_error = RF_train$prediction.error,
             True_positive = RF_train$confusion.matrix[1])
}

然后您可以通过使用 purrr::pmap:

映射行来将结果添加为列

hyper_grid$res <- purrr::pmap(hyper_grid, fit_and_extract_metrics)

通过这种方式映射，函数是逐行应用的，所以你不应该运行进入内存问题。

purrr::pmap 的结果是一个列表，这意味着列 res 包含每一行的列表。这可以使用 tidyr::unnest 取消嵌套，以将该列表的元素分布到您的数据框中。

tidyr::unnest(hyper_grid, res)

我认为这种方法非常优雅，但如果您想了解更多，则需要一些 tidyverse knowledge. I highly recommend this book。第 25 章（许多模型）描述了一种类似于我在此处采用的方法。

如何使用 expand.grid 值到运行 R 中游侠的各种模型超参数组合

How to use expand.grid values to run various model hyperparameter combinations for ranger in R

r

grid-search

如何使用 expand.grid 值到 运行 R 中游侠的各种模型超参数组合

How to use expand.grid values to run various model hyperparameter combinations for ranger in R

r

grid-search

如何使用 expand.grid 值到运行 R 中游侠的各种模型超参数组合