R: using a lmer model in fit_resamples() fails with "Error: Assigned data `factor(lvl[1], levels = lvl)` must be compatible with existing data."

Question

我正在尝试使用 tidymodels 包构建线性混合模型。看起来我正在以正确的方式指定公式，因为我可以运行在工作流程中使用 fit() 。但是，当我尝试使用函数 fit_resamples() 运行对我的数据进行重新采样时，我得到了一个与缺失因子水平相关的错误。

我不清楚我是否做错了什么，或者包“multilevelmod”和“tune”是否以这种方式不兼容，所以我将不胜感激任何建议。

我使用“mpg”数据集包含了一个 reprex。

编辑：在深入研究问题之后，尤其是在这个问题上：我发现了如何使用 allow.new.levels = TRUE 参数让 lmer 模型预测新的值组合在 predict() 函数中。那么我的问题是，如何在 workflow() 或 fit_resamples()?

中指定它

我猜想在配方中添加 step_novel() 并在 add_recipe() 调用中添加 default_recipe_blueprint(allow_novel_levels = TRUE) 就足够了（如), 但似乎还是不行。

library(tidyverse)
library(tidymodels)
library(multilevelmod)

set.seed(1243)

data(mpg, package = "ggplot2")

training = mpg %>%
  initial_split() %>%
  training() %>% 
  mutate(manufacturer = manufacturer %>% as_factor(),
         model = model %>% as_factor())

training_folds = training %>%
  validation_split()

lmm_model = linear_reg() %>% 
  set_engine("lmer")

lmm_recipe = recipe(cty ~ year + manufacturer + model, data = training) %>%
  step_novel(manufacturer, model)

lmm_formula = cty ~ year + (1|manufacturer/model)

lmm_workflow = workflow() %>% 
  # see: 
  add_recipe(lmm_recipe, blueprint = hardhat::default_recipe_blueprint(allow_novel_levels = TRUE)) %>% 
  add_model(lmm_model, formula = lmm_formula)

# A simple fit works
fit(lmm_workflow, training)
#> == Workflow [trained] ==========================================================
#> Preprocessor: Recipe
#> Model: linear_reg()
#> 
#> -- Preprocessor ----------------------------------------------------------------
#> 1 Recipe Step
#> 
#> * step_novel()
#> 
#> -- Model -----------------------------------------------------------------------
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: cty ~ year + (1 | manufacturer/model)
#>    Data: data
#> REML criterion at convergence: 864.9986
#> Random effects:
#>  Groups             Name        Std.Dev.
#>  model:manufacturer (Intercept) 2.881   
#>  manufacturer       (Intercept) 2.675   
#>  Residual                       2.181   
#> Number of obs: 175, groups:  model:manufacturer, 38; manufacturer, 15
#> Fixed Effects:
#> (Intercept)         year  
#>    -8.06202      0.01228

# A fit with resamplings doesn't work:
fit_resamples(lmm_workflow, resamples = training_folds)
#> Warning: package 'lme4' was built under R version 4.1.1
#> x validation: preprocessor 1/1, model 1/1 (predictions): Error:
#> ! Assigned data `facto...
#> Warning: All models failed. See the `.notes` column.
#> # Resampling results
#> # Validation Set Split (0.75/0.25)  
#> # A tibble: 1 x 4
#>   splits           id         .metrics .notes          
#>   <list>           <chr>      <list>   <list>          
#> 1 <split [131/44]> validation <NULL>   <tibble [1 x 3]>
#> 
#> There were issues with some computations:
#> 
#>   - Error(s) x1:  ! Assigned data `factor(lvl[1], levels = lvl)` must be compatibl...
#> 
#> Use `collect_notes(object)` for more information.

# It seems that the problem is that the combinations of factor levels differ
# between analysis and assessment set
analysis_set = analysis(training_folds$splits[[1]])
assessment_set = assessment(training_folds$splits[[1]])

identical(
  analysis_set %>% distinct(manufacturer, model),
  assessment_set %>% distinct(manufacturer, model)
)
#> [1] FALSE

# directly fitting the model on the analysis set
analysis_fit = lmer(formula = lmm_formula, data = analysis_set)

# predicting the values for the missing combinations of levels is actually possible
# see: 
assessment_predict = analysis_fit %>%
  predict(training_folds$splits[[1]] %>% assessment(),
          allow.new.levels = TRUE)

^{由 reprex package (v2.0.1)}

创建于 2022-05-24

Answer 1

我认为问题在于您最终在训练集和测试集中得到了不同的因子水平（在 tidymodel 中称为重采样分析和评估）：

library(tidymodels)
data(mpg, package = "ggplot2")

training <- mpg %>%
  initial_split() %>%
  training()

training_folds <- training %>%
  validation_split()

training %>% count(manufacturer, model)
#> # A tibble: 38 × 3
#>    manufacturer model                  n
#>    <chr>        <chr>              <int>
#>  1 audi         a4                     5
#>  2 audi         a4 quattro             5
#>  3 audi         a6 quattro             3
#>  4 chevrolet    c1500 suburban 2wd     4
#>  5 chevrolet    corvette               4
#>  6 chevrolet    k1500 tahoe 4wd        3
#>  7 chevrolet    malibu                 5
#>  8 dodge        caravan 2wd            6
#>  9 dodge        dakota pickup 4wd      8
#> 10 dodge        durango 4wd            5
#> # … with 28 more rows
analysis(training_folds$splits[[1]]) %>% count(manufacturer, model)
#> # A tibble: 37 × 3
#>    manufacturer model                  n
#>    <chr>        <chr>              <int>
#>  1 audi         a4                     1
#>  2 audi         a4 quattro             4
#>  3 audi         a6 quattro             3
#>  4 chevrolet    c1500 suburban 2wd     4
#>  5 chevrolet    corvette               3
#>  6 chevrolet    k1500 tahoe 4wd        2
#>  7 chevrolet    malibu                 4
#>  8 dodge        caravan 2wd            5
#>  9 dodge        dakota pickup 4wd      5
#> 10 dodge        durango 4wd            4
#> # … with 27 more rows
assessment(training_folds$splits[[1]]) %>% count(manufacturer, model)
#> # A tibble: 28 × 3
#>    manufacturer model                   n
#>    <chr>        <chr>               <int>
#>  1 audi         a4                      4
#>  2 audi         a4 quattro              1
#>  3 chevrolet    corvette                1
#>  4 chevrolet    k1500 tahoe 4wd         1
#>  5 chevrolet    malibu                  1
#>  6 dodge        caravan 2wd             1
#>  7 dodge        dakota pickup 4wd       3
#>  8 dodge        durango 4wd             1
#>  9 dodge        ram 1500 pickup 4wd     1
#> 10 ford         expedition 2wd          3
#> # … with 18 more rows

^{由 reprex package (v2.0.1)}

于 2022-05-23 创建

当您用所有训练数据拟合一次模型时，您有 38 种制造商和型号的组合。当您使用验证拆分进行拟合时，training/analysis 集中有 37 种组合，testing/assessment 集中有 28 种组合。其中一些不重叠，因此模型无法预测正在评估但正在分析的观察结果。