变量重要性 Tidymodels 与带有交互的 Caret
Variable Importance Tidymodels versus Caret with Interactions
为什么包含交互项时,tidymodels 和 caret 之间的变量重要性图不同?我已经用下面的艾姆斯住房数据进行了演示。我在两个模型中使用了相同的 alpha/mixture 和 lambda/penalty。交叉验证折叠的模型之间的唯一区别(我无法弄清楚如何将 tidymodel 的折叠与插入符号的火车一起使用)。关于为什么会发生这种情况的任何想法?
library(AmesHousing)
library(tidymodels)
library(caret)
library(vip)
df <- data.frame(ames_raw)
head(df)
# replace any missing observation with the mean
for(i in 1:ncol(df)){
df[is.na(df[,i]), i] <- mean(df[,i], na.rm = TRUE)
}
# Create a data split object
set.seed(1994)
home_split <- initial_split(df,
prop = 0.7,
strata = SalePrice)
home_train <- home_split %>%
training()
home_test <- home_split %>%
testing()
# pre-process recipe
recipe_home <- recipe(SalePrice ~ Yr.Sold + Fireplaces + Full.Bath + Half.Bath + Year.Built + Lot.Area,
data = home_train) %>%
step_interact(terms = ~ Yr.Sold:Fireplaces:Full.Bath:Half.Bath:Year.Built:Lot.Area)
# model with hyperparameters
glmnet_model <- linear_reg(penalty = tune(), # lambda
mixture = tune()) %>% # alpha
set_engine('glmnet') %>%
set_mode('regression')
# model + recipe = workflow
wkfl <- workflow() %>%
add_model(glmnet_model) %>%
add_recipe(recipe_home)
# cv
set.seed(1994)
myfolds <- vfold_cv(home_train,
v = 10,
strata = SalePrice)
# grid search with cv
set.seed(1994)
glmnet_tuning <- wkfl %>%
tune_grid(resamples = myfolds,
grid = 25, # let the model find the best hyperparameters
metrics = metric_set(rmse))
glmnet_tuning
# select the best model
best_glmnet_model <- glmnet_tuning %>%
select_best(metric = 'rmse')
best_glmnet_model
# finalize the workflow
final_glmnet_wkfl <- wkfl %>%
finalize_workflow(best_glmnet_model)
# last_fit:
glmnet_final_fit <- final_glmnet_wkfl %>%
last_fit(split = home_split)
# extract the final model
final_glmnet <- extract_workflow(glmnet_final_fit)
# VIP final model
final_glmnet %>%
extract_fit_parsnip() %>%
vip(geom = "point", scale = TRUE)
set.seed(1994)
myGrid <- expand.grid(lambda = 0.00386,
alpha = 0.0874)
model_glmnet <- train(SalePrice ~ (Yr.Sold + Fireplaces + Full.Bath + Half.Bath + Year.Built
+ Lot.Area)^2,
data=home_train,
method = "glmnet",
tune_grid = myGrid,
metric = "RMSE",
maximize = FALSE,
trControl = trainControl(
method = "cv",
number = 10))
# variable importance
vip(model_glmnet, geom = "point", scale = TRUE)
看起来这两个模型规格具有非常不同的特征,这就是为什么您会看到不同的重要性图。
在您的示例中,recipe_home
有一个用于表示一组变量的交互式术语:Yr.Sold:Fireplaces:Full.Bath:Half.Bath:Year.Built:Lot.Area
。
recipe_home <-
recipe(SalePrice ~ Yr.Sold + Fireplaces + Full.Bath + Half.Bath + Year.Built + Lot.Area,
data = home_train) %>%
step_interact(terms = ~ Yr.Sold:Fireplaces:Full.Bath:Half.Bath:Year.Built:Lot.Area)
在您的 {glmnet}
模型中,您通过在公式 (this gives a good definition of crossing) 中使用 ^2
在两个变量之间创建一大堆交互作用。
model_glmnet <-
train(SalePrice ~ (Yr.Sold + Fireplaces + Full.Bath + Half.Bath + Year.Built + Lot.Area)^2,
data = home_train,
method = "glmnet",
tune_grid = myGrid,
metric = "RMSE",
maximize = FALSE,
trControl = trainControl(method = "cv", number = 10))
所以在第二个图中,^2
交互(例如,Fireplaces:Full.Bath
)创建了一堆重要特征,这些特征根本没有出现在 recipe_home
中型号。
根据您想要的交互式术语,您应该能够通过更改 recipe_home
的公式以删除 step_interact()
并添加 ^2
[ 来匹配模型=34=] 或 通过删除 ^2
并添加长交互项来更改 glmnet 模型中的公式。
为什么包含交互项时,tidymodels 和 caret 之间的变量重要性图不同?我已经用下面的艾姆斯住房数据进行了演示。我在两个模型中使用了相同的 alpha/mixture 和 lambda/penalty。交叉验证折叠的模型之间的唯一区别(我无法弄清楚如何将 tidymodel 的折叠与插入符号的火车一起使用)。关于为什么会发生这种情况的任何想法?
library(AmesHousing)
library(tidymodels)
library(caret)
library(vip)
df <- data.frame(ames_raw)
head(df)
# replace any missing observation with the mean
for(i in 1:ncol(df)){
df[is.na(df[,i]), i] <- mean(df[,i], na.rm = TRUE)
}
# Create a data split object
set.seed(1994)
home_split <- initial_split(df,
prop = 0.7,
strata = SalePrice)
home_train <- home_split %>%
training()
home_test <- home_split %>%
testing()
# pre-process recipe
recipe_home <- recipe(SalePrice ~ Yr.Sold + Fireplaces + Full.Bath + Half.Bath + Year.Built + Lot.Area,
data = home_train) %>%
step_interact(terms = ~ Yr.Sold:Fireplaces:Full.Bath:Half.Bath:Year.Built:Lot.Area)
# model with hyperparameters
glmnet_model <- linear_reg(penalty = tune(), # lambda
mixture = tune()) %>% # alpha
set_engine('glmnet') %>%
set_mode('regression')
# model + recipe = workflow
wkfl <- workflow() %>%
add_model(glmnet_model) %>%
add_recipe(recipe_home)
# cv
set.seed(1994)
myfolds <- vfold_cv(home_train,
v = 10,
strata = SalePrice)
# grid search with cv
set.seed(1994)
glmnet_tuning <- wkfl %>%
tune_grid(resamples = myfolds,
grid = 25, # let the model find the best hyperparameters
metrics = metric_set(rmse))
glmnet_tuning
# select the best model
best_glmnet_model <- glmnet_tuning %>%
select_best(metric = 'rmse')
best_glmnet_model
# finalize the workflow
final_glmnet_wkfl <- wkfl %>%
finalize_workflow(best_glmnet_model)
# last_fit:
glmnet_final_fit <- final_glmnet_wkfl %>%
last_fit(split = home_split)
# extract the final model
final_glmnet <- extract_workflow(glmnet_final_fit)
# VIP final model
final_glmnet %>%
extract_fit_parsnip() %>%
vip(geom = "point", scale = TRUE)
set.seed(1994)
myGrid <- expand.grid(lambda = 0.00386,
alpha = 0.0874)
model_glmnet <- train(SalePrice ~ (Yr.Sold + Fireplaces + Full.Bath + Half.Bath + Year.Built
+ Lot.Area)^2,
data=home_train,
method = "glmnet",
tune_grid = myGrid,
metric = "RMSE",
maximize = FALSE,
trControl = trainControl(
method = "cv",
number = 10))
# variable importance
vip(model_glmnet, geom = "point", scale = TRUE)
看起来这两个模型规格具有非常不同的特征,这就是为什么您会看到不同的重要性图。
在您的示例中,recipe_home
有一个用于表示一组变量的交互式术语:Yr.Sold:Fireplaces:Full.Bath:Half.Bath:Year.Built:Lot.Area
。
recipe_home <-
recipe(SalePrice ~ Yr.Sold + Fireplaces + Full.Bath + Half.Bath + Year.Built + Lot.Area,
data = home_train) %>%
step_interact(terms = ~ Yr.Sold:Fireplaces:Full.Bath:Half.Bath:Year.Built:Lot.Area)
在您的 {glmnet}
模型中,您通过在公式 (this gives a good definition of crossing) 中使用 ^2
在两个变量之间创建一大堆交互作用。
model_glmnet <-
train(SalePrice ~ (Yr.Sold + Fireplaces + Full.Bath + Half.Bath + Year.Built + Lot.Area)^2,
data = home_train,
method = "glmnet",
tune_grid = myGrid,
metric = "RMSE",
maximize = FALSE,
trControl = trainControl(method = "cv", number = 10))
所以在第二个图中,^2
交互(例如,Fireplaces:Full.Bath
)创建了一堆重要特征,这些特征根本没有出现在 recipe_home
中型号。
根据您想要的交互式术语,您应该能够通过更改 recipe_home
的公式以删除 step_interact()
并添加 ^2
[ 来匹配模型=34=] 或 通过删除 ^2
并添加长交互项来更改 glmnet 模型中的公式。