Tidymodels：在 R 中进行 10 折交叉验证后，从 TIbble 中取消嵌套最佳拟合模型的 RMSE 和 RSQ 值

Question

概览

我使用带有数据框 FID 的 tidymodels 包生成了四个模型（参见下面的 R-code）:

一般线性模型 (glm)
袋装树
随机森林
增强树

数据框包含三个预测变量:

年份（数字）
月（因素）
天（数字）

因变量是频率（数值）

瞄准

我的目标是取消嵌套 best-fitted 模型（即 glm、袋装树、随机森林、增强树），以便在从 tibble 执行 10 倍 cross-validation 后显示指标 RMSE 和 RSQ object 使用函数 fit_samples().

生成

小标题示例

# Resampling results
# 10-fold cross-validation 
# A tibble: 10 x 5
   splits         id     .metrics         .notes           .predictions    
   <list>         <chr>  <list>           <list>           <list>          
 1 <split [24/3]> Fold01 <tibble [2 × 3]> <tibble [0 × 1]> <tibble [3 × 3]>
 2 <split [24/3]> Fold02 <tibble [2 × 3]> <tibble [0 × 1]> <tibble [3 × 3]>
 3 <split [24/3]> Fold03 <tibble [2 × 3]> <tibble [0 × 1]> <tibble [3 × 3]>
 4 <split [24/3]> Fold04 <tibble [2 × 3]> <tibble [0 × 1]> <tibble [3 × 3]>
 5 <split [24/3]> Fold05 <tibble [2 × 3]> <tibble [0 × 1]> <tibble [3 × 3]>
 6 <split [24/3]> Fold06 <tibble [2 × 3]> <tibble [0 × 1]> <tibble [3 × 3]>
 7 <split [24/3]> Fold07 <tibble [2 × 3]> <tibble [0 × 1]> <tibble [3 × 3]>
 8 <split [25/2]> Fold08 <tibble [2 × 3]> <tibble [0 × 1]> <tibble [2 × 3]>
 9 <split [25/2]> Fold09 <tibble [2 × 3]> <tibble [0 × 1]> <tibble [2 × 3]>
10 <split [25/2]> Fold10 <tibble [2 × 3]> <tibble [0 × 1]> <tibble [2 × 3]>

我想可视化最好的模型（即 glm、袋装树、随机森林、增强树） 通过生成真实值在 [=147= 上的地块] 并且预测值在 y-axis 上，如下面的教程和绘图所示。

教程

https://www.tmwr.org/performance.html

当我尝试使用函数 predict() 预测测试数据的拟合模型时，我在 attempt 1 和 中不断遇到这些错误消息尝试 2:-

错误消息 - 尝试 1

 Error in UseMethod("predict") : 
  no applicable method for 'predict' applied to an object of class "c('resample_results', 'tune_results', 'tbl_df', 'tbl', 'data.frame')"

错误消息 - 尝试 2

Error: `...` is not empty.

We detected these problematic arguments:
* `..1`

These dots only exist to allow future extensions and should be empty.
Did you misspecify an argument?

问题

我觉得我必须解除 RMSE 和 RSQ 的嵌套所有拟合模型（即 glm、袋装树、随机森林、增强树）的指标，然后我才能使用拟合模型对测试数据进行模型预测，以便评估模型有效性，或 从为拟合模型而创建的函数中 10 倍 cross-validation 检查的模型范围中取出最佳模型。

如果有人能够帮助我解决使用函数 predict() 预测拟合模型上的测试数据的问题，我将不胜感激。如果不将真实值和观测值绑定到一个数据框中以使用 ggplot() 进行绘图，我无法在单个图中可视化 RMSE 和 RSQ 指标。

非常感谢。

剧情图

R-code

尝试 1

################################################## ##Model Prediction ################################################### ##Open the tidymodels package library(tidymodels) library(tidyverse) library(glmnet) library(parsnip) library(rpart) library(tidyverse) # manipulating data library(skimr) # data visualization library(baguette) # bagged trees library(future) # parallel processing & decrease computation time library(xgboost) # boosted trees library(ranger) library(yardstick) library(purrr) library(forcats) ########################################################### #split this single dataset into two: a training set and a testing set data_split <- initial_split(FID) # Create data frames for the two sets: train_data <- training(data_split) test_data <- testing(data_split) # resample the data with 10-fold cross-validation (10-fold by default) cv <- vfold_cv(train_data, v=10) ########################################################### ##Produce the recipe rec <- recipe(Frequency ~ ., data = FID) %>% step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars")) %>% # replaces missing numeric observations with the median step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables ########################################################### ##Create Models ########################################################### ########################################################## ##General Linear Models ######################################################### ##glm mod_glm<-linear_reg(mode="regression", penalty = 0.1, mixture = 1) %>% set_engine("glmnet") ##Create workflow wflow_glm <- workflow() %>% add_recipe(rec) %>% add_model(mod_glm) ##Fit the model plan(multisession) fit_glm <- fit_resamples( wflow_glm, cv, metrics = metric_set(rmse, rsq), control = control_resamples(save_pred = TRUE, extract = function(x) extract_model(x))) ########################################################## ##Bagged Trees ########################################################## #####Bagged Trees mod_bag <- bag_tree() %>% set_mode("regression") %>% set_engine("rpart", times = 10) #10 bootstrap resamples ##Create workflow wflow_bag <- workflow() %>% add_recipe(rec) %>% add_model(mod_bag) ##Fit the model plan(multisession) fit_bag <- fit_resamples( wflow_bag, cv, metrics = metric_set(rmse, rsq), control = control_resamples(save_pred = TRUE, extract = function(x) extract_model(x))) ################################################### ##Random forests ################################################### mod_rf <-rand_forest(trees = 1e3) %>% set_engine("ranger", num.threads = parallel::detectCores(), importance = "permutation", verbose = TRUE) %>% set_mode("regression") ##Create Workflow wflow_rf <- workflow() %>% add_model(mod_rf) %>% add_recipe(rec) ##Fit the model plan(multisession) fit_rf<-fit_resamples( wflow_rf, cv, metrics = metric_set(rmse, rsq), control = control_resamples(save_pred = TRUE, extract = function(x) extract_model(x))) ############################################################ ##Boosted Trees ############################################################ mod_boost <- boost_tree() %>% set_engine("xgboost", nthreads = parallel::detectCores()) %>% set_mode("regression") ##Create Workflow wflow_boost <- workflow() %>% add_recipe(rec) %>% add_model(mod_boost) ##Fit model plan(multisession) fit_boost <-fit_resamples( wflow_boost, cv, metrics = metric_set(rmse, rsq), control = control_resamples(save_pred = TRUE, extract = function(x) extract_model(x)))

模型预测

################################### ##Model Prediction #################################### ##glm model test_res <- predict(fit_glm, new_data = test_data %>% select(-Frequency)) ##Error Message Error in UseMethod("predict") : no applicable method for 'predict' applied to an object of class "c('resample_results', 'tune_results', 'tbl_df', 'tbl', 'data.frame')" ##Predicted numeric outcome from the regression model is named .pred. Let’s match #the predicted values with their corresponding observed outcome values: bind_test_res <- bind_cols(test_res, test_data %>% select(Frequency)) #Note that both the predicted and observed outcomes are in log10 units. #It is best practice to analyze the predictions on the transformed scale #(if one were used) even if the predictions are reported using the original units.

使用 ggplot() 绘制数据：

ggplot(bind_test_res, aes(x = Frequency, y = .pred)) + # Create a diagonal line: geom_abline(lty = 2) + geom_point(alpha = 0.5) + labs(y = "Predicted Frequency (log10)", x = "Frequency (log10)") + # Scale and size the x- and y-axis uniformly: coord_obs_pred()

尝试 2

##split this single dataset into two: a training set and a testing set data_split <- initial_split(FID) # Create data frames for the two sets: train_data <- training(data_split) test_data <- testing(data_split) ##Produce the recipe rec <- recipe(Frequency ~ ., data = FID) %>% step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars")) %>% # replaces missing numeric observations with the median step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables # resample the data with 10-fold cross-validation (10-fold by default) cv <- vfold_cv(train_data, v=10) Run our models # Extract our prepped training data # and "bake" our testing data prep<-prep(rec) training_baked<-juice(prep) testing_baked <- prep %>% bake(test_data) ##glm model glm_model<-linear_reg(mode="regression", penalty = 0.1, mixture = 1) %>% set_engine("glmnet") ##Create workflow wflow_glm <- workflow() %>% add_recipe(prep) %>% add_model(glm_model) ##fit the model fit_glm<- wflow_glm %>% fit(Frequency~Year+Month+Days, data=FID) ##Error Message Error: `...` is not empty. We detected these problematic arguments: * `..1` These dots only exist to allow future extensions and should be empty. Did you misspecify an argument?

数据框-FID

structure(list(Year = c(2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017), Month = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L), .Label = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"), class = "factor"), Frequency = c(36, 28, 39, 46, 5, 0, 0, 22, 10, 15, 8, 33, 33, 29, 31, 23, 8, 9, 7, 40, 41, 41, 30, 30, 44, 37, 41, 42, 20, 0, 7, 27, 35, 27, 43, 38), Days = c(31, 28, 31, 30, 6, 0, 0, 29, 15, 29, 29, 31, 31, 29, 30, 30, 7, 0, 7, 30, 30, 31, 30, 27, 31, 28, 30, 30, 21, 0, 7, 26, 29, 27, 29, 29)), row.names = c(NA, -36L), class = "data.frame")

Answer 1

此答案的灵感来自 Max Khun

#split this single dataset into two: a training set and a testing set
data_split <- initial_split(FID)
# Create data frames for the two sets:
train_data <- training(data_split)
test_data  <- testing(data_split)

# resample the data with 10-fold cross-validation (10-fold by default)
cv <- vfold_cv(train_data, v=10)

###########################################################
##Produce the recipe

rec <- recipe(Frequency ~ ., data = FID) %>% 
          step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances
          step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels 
          step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars"))  %>% # replaces missing numeric observations with the median
          step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables

##########################################################
##Produce Models
##########################################################
##General Linear Models
##########################################################

##Produce the glm model
mod_glm<-linear_reg(mode="regression",
                       penalty = 0.1, 
                       mixture = 1) %>% 
                            set_engine("glmnet")

##Create workflow
wflow_glm <- workflow() %>% 
                add_recipe(rec) %>%
                      add_model(mod_glm)

#######################################################################
##MODEL EVALUATION
#######################################################################
##Estimate how well that model performs, let’s fit many times, 
##once to each of these resampled folds, and then evaluate on the heldout 
##part of each resampled fold.
##########################################################################
plan(multisession)

fit_glm <- fit_resamples(
                        wflow_glm,
                        cv,
                        metrics = metric_set(rmse, rsq),
                        control = control_resamples(save_pred = TRUE)
                        )

##Collect model predictions for each fold for the predictor frequency

Predictions<-fit_glm %>% 
                    collect_predictions()

##Produce a data frame of the Predictions model

Prediction<-as.data.frame(Predictions)

##Open a new plotting window
dev.new()

##Visualise the data by plotting the predicted vs true values
ggplot(Prediction, aes(x = Frequency, y = .pred)) + 
  # Create a diagonal line:
  geom_abline(lty = 2) + 
  geom_point(alpha = 0.5) + 
  labs(y = "Predicted Frequency (log10)", x = "Frequency (log10)") +
  # Scale and size the x- and y-axis uniformly:
  coord_obs_pred()

情节

Tidymodels：在 R 中进行 10 折交叉验证后，从 TIbble 中取消嵌套最佳拟合模型的 RMSE 和 RSQ 值

Tidymodels: Unnest the RMSE and RSQ Values for the Best Fitted Model from a TIbble after conducting a 10-fold cross validation in R

regression

r

machine-learning

ggplot2

tidymodels