Tidymodels Error: Can't rename variables in this context

Question

在学校使用 R 几个月后，我最近开始使用 Tidymodels。

我试图在 Kaggle 上使用 Titanic Dataset 制作我的第一个模型，但是运行在拟合模型时遇到了一些问题。有人可以帮助我吗？

titanic_rec <- recipe(Survived ~ Sex + Age + Pclass + Embarked + Family_Size + Name, data = titanic_train) %>%
  step_impute_knn(all_predictors(), k = 3) %>% 
  step_dummy(Sex, Pclass, Embarked, Family_Size, Name) %>% 
  step_interact(~ Sex:Age + Sex:Pclass + Pclass:Age)
  
log_model <- logistic_reg() %>% 
              set_engine("glm") %>% 
              set_mode("classification")

fitted_log_model <- workflow() %>%
                      add_model(log_model) %>%
                      add_recipe(titanic_rec) %>% 
                      fit(data = titanic_train) %>% 
                      pull_workflow_fit() %>% 
                      tidy()

除了 Age 和 Survived 是双精度数据类型外，每个特征都有一个因子数据类型。当我开始包含 fit(data = ...) 时，似乎会出现错误。

Error: Can't rename variables in this context. Run `rlang::last_error()` to see where the error occurred.
24.
stop(fallback)
23.
signal_abort(cnd)
22.
abort("Can't rename variables in this context.")
21.
eval_select_recipes(to_impute, training, info)
20.
impute_var_lists(to_impute = x$terms, impute_using = x$impute_with, training = training, info = info)
19.
prep.step_impute_knn(x$steps[[i]], training = training, info = x$term_info)
18.
prep(x$steps[[i]], training = training, info = x$term_info)
17.
prep.recipe(blueprint$recipe, training = data, fresh = blueprint$fresh)
16.
recipes::prep(blueprint$recipe, training = data, fresh = blueprint$fresh)
15.
blueprint$mold$process(blueprint = blueprint, data = data)
14.
run_mold.recipe_blueprint(blueprint, data)
13.
run_mold(blueprint, data)
12.
mold.recipe(recipe, data, blueprint = blueprint)
11.
hardhat::mold(recipe, data, blueprint = blueprint)
10.
fit.action_recipe(action, workflow = workflow, data = data)
9.
fit(action, workflow = workflow, data = data)
8.
.fit_pre(workflow, data)
7.
fit.workflow(., data = titanic_train)
6.
fit(., data = titanic_train)
5.
is_workflow(x)
4.
validate_is_workflow(x)
3.
pull_workflow_fit(.)
2.
tidy(.)
1.
workflow() %>% add_model(log_model) %>% add_recipe(titanic_rec) %>% fit(data = titanic_train) %>% pull_workflow_fit() %>% tidy()

Answer 1

发布的错误来自 step_impute_knn()，其中邻居的数量应由 neighbors 指定。其次，我建议不要使用 name 作为预测变量，因为它会为每个名称创建一个单独的虚拟变量，这会影响拟合。

最后的错误出现在step_interact()。您不能在 step_dummy(Sex) 之后使用 step_interact(~ Sex:Age)，因为在完成 step_dummy() 之后不会有任何名为 Sex 的列。相反，它将具有 Sex_male（因为女性是拦截的一部分）。捕获所有创建的虚拟变量的一种方法是在 step_interact().

中使用 starts_with()

library(tidymodels)

titanic_train <- readr::read_csv("your/path/to/data/train.csv")

titanic_train <- titanic_train %>%
  mutate(Survived = factor(Survived),
         Pclass = factor(Pclass),
         Family_Size = SibSp + Parch + 1)

titanic_rec <- recipe(Survived ~ Sex + Age + Pclass + Embarked + Family_Size, 
                      data = titanic_train) %>%
  step_impute_knn(all_predictors(), neighbors = 3) %>% 
  step_dummy(Sex, Pclass, Embarked) %>% 
  step_interact(~ starts_with("Sex_"):Age + 
                  starts_with("Sex_"):starts_with("Pclass_") + 
                  starts_with("Pclass_"):Age)
  
log_model <- logistic_reg() %>% 
              set_engine("glm") %>% 
              set_mode("classification")

fitted_log_model <- workflow() %>%
                      add_model(log_model) %>%
                      add_recipe(titanic_rec) %>% 
                      fit(data = titanic_train) %>% 
                      pull_workflow_fit() %>% 
                      tidy()

fitted_log_model
#> # A tibble: 13 x 5
#>    term                 estimate std.error statistic   p.value
#>    <chr>                   <dbl>     <dbl>     <dbl>     <dbl>
#>  1 (Intercept)            3.85      0.921      4.18  0.0000289
#>  2 Age                    0.0117    0.0226     0.516 0.606    
#>  3 Family_Size           -0.226     0.0671    -3.36  0.000769 
#>  4 Sex_male              -2.22      0.886     -2.50  0.0124   
#>  5 Pclass_X2              1.53      1.16       1.31  0.189    
#>  6 Pclass_X3             -2.42      0.884     -2.74  0.00615  
#>  7 Embarked_Q            -0.0461    0.368     -0.125 0.900    
#>  8 Embarked_S            -0.548     0.243     -2.26  0.0241   
#>  9 Sex_male_x_Age        -0.0488    0.0199    -2.46  0.0140   
#> 10 Sex_male_x_Pclass_X2  -1.28      0.879     -1.46  0.144    
#> 11 Sex_male_x_Pclass_X3   1.48      0.699      2.11  0.0347   
#> 12 Age_x_Pclass_X2       -0.0708    0.0263    -2.69  0.00714  
#> 13 Age_x_Pclass_X3       -0.0341    0.0209    -1.63  0.103

^{由 reprex package (v2.0.0)}

于 2021 年 7 月 1 日创建