变量重要性 Tidymodels 与带有交互的 Caret

Variable Importance Tidymodels versus Caret with Interactions

为什么包含交互项时,tidymodels 和 caret 之间的变量重要性图不同?我已经用下面的艾姆斯住房数据进行了演示。我在两个模型中使用了相同的 alpha/mixture 和 lambda/penalty。交叉验证折叠的模型之间的唯一区别(我无法弄清楚如何将 tidymodel 的折叠与插入符号的火车一起使用)。关于为什么会发生这种情况的任何想法?

library(AmesHousing)
library(tidymodels)
library(caret)
library(vip)

df <- data.frame(ames_raw)
head(df)



# replace any missing observation with the mean

for(i in 1:ncol(df)){

  df[is.na(df[,i]), i] <- mean(df[,i], na.rm = TRUE)

}



# Create a data split object

set.seed(1994)
home_split <- initial_split(df,
                        
                        prop = 0.7,
                        
                        strata = SalePrice)



home_train <- home_split %>%

  training()



home_test <- home_split %>%

  testing()


# pre-process recipe

recipe_home <- recipe(SalePrice ~ Yr.Sold + Fireplaces + Full.Bath + Half.Bath + Year.Built +     Lot.Area,
                  
                  data = home_train) %>%

  step_interact(terms = ~ Yr.Sold:Fireplaces:Full.Bath:Half.Bath:Year.Built:Lot.Area)



# model with hyperparameters

glmnet_model <- linear_reg(penalty = tune(), # lambda
                      
                       mixture = tune()) %>% # alpha

  set_engine('glmnet') %>%

  set_mode('regression')



# model + recipe = workflow

wkfl <- workflow() %>%

  add_model(glmnet_model) %>%

  add_recipe(recipe_home)



# cv

set.seed(1994)

myfolds <- vfold_cv(home_train,
                
                v = 10,
                
                strata = SalePrice)



# grid search with cv

set.seed(1994)

glmnet_tuning <- wkfl %>%

  tune_grid(resamples = myfolds,
        
        grid = 25, # let the model find the best hyperparameters
        
        metrics = metric_set(rmse))



glmnet_tuning





# select the best model

best_glmnet_model <- glmnet_tuning %>%

  select_best(metric = 'rmse')

best_glmnet_model


# finalize the workflow

final_glmnet_wkfl <- wkfl %>%

  finalize_workflow(best_glmnet_model)



# last_fit:


glmnet_final_fit <- final_glmnet_wkfl %>%

  last_fit(split = home_split)



# extract the final model

final_glmnet <- extract_workflow(glmnet_final_fit)


# VIP final model

final_glmnet %>%

  extract_fit_parsnip() %>%

  vip(geom = "point", scale = TRUE)

set.seed(1994)

myGrid <- expand.grid(lambda = 0.00386,
                  alpha = 0.0874)

 model_glmnet <- train(SalePrice ~ (Yr.Sold + Fireplaces + Full.Bath + Half.Bath + Year.Built 
                                   + Lot.Area)^2,
                  
                  data=home_train,
                  
                  method = "glmnet",
                  
                  tune_grid = myGrid,
                  
                  metric = "RMSE",
                  
                  maximize = FALSE,
                  
                  trControl = trainControl(
                    
                    method = "cv",
                    
                    number = 10))





# variable importance

vip(model_glmnet, geom = "point", scale = TRUE)

看起来这两个模型规格具有非常不同的特征,这就是为什么您会看到不同的重要性图。

在您的示例中,recipe_home 有一个用于表示一组变量的交互式术语:Yr.Sold:Fireplaces:Full.Bath:Half.Bath:Year.Built:Lot.Area

recipe_home <- 
  recipe(SalePrice ~ Yr.Sold + Fireplaces + Full.Bath + Half.Bath + Year.Built + Lot.Area,
         data = home_train) %>%
  step_interact(terms = ~ Yr.Sold:Fireplaces:Full.Bath:Half.Bath:Year.Built:Lot.Area)

在您的 {glmnet} 模型中,您通过在公式 (this gives a good definition of crossing) 中使用 ^2 在两个变量之间创建一大堆交互作用。

model_glmnet <- 
  train(SalePrice ~ (Yr.Sold + Fireplaces + Full.Bath + Half.Bath + Year.Built + Lot.Area)^2,
        data = home_train,
        method = "glmnet",
        tune_grid = myGrid,
        metric = "RMSE",
        maximize = FALSE,
        trControl = trainControl(method = "cv", number = 10))

所以在第二个图中,^2 交互(例如,Fireplaces:Full.Bath)创建了一堆重要特征,这些特征根本没有出现在 recipe_home 中型号。

根据您想要的交互式术语,您应该能够通过更改 recipe_home 的公式以删除 step_interact() 并添加 ^2 [ 来匹配模型=34=] 或 通过删除 ^2 并添加长交互项来更改 glmnet 模型中的公式。