R: using a lmer model in fit_resamples() fails with "Error: Assigned data `factor(lvl[1], levels = lvl)` must be compatible with existing data."
R: using a lmer model in fit_resamples() fails with "Error: Assigned data `factor(lvl[1], levels = lvl)` must be compatible with existing data."
我正在尝试使用 tidymodels 包构建线性混合模型。看起来我正在以正确的方式指定公式,因为我可以 运行 在工作流程中使用 fit() 。
但是,当我尝试使用函数 fit_resamples() 运行 对我的数据进行重新采样时,我得到了一个与缺失因子水平相关的错误。
我不清楚我是否做错了什么,或者包“multilevelmod”和“tune”是否以这种方式不兼容,所以我将不胜感激任何建议。
我使用“mpg”数据集包含了一个 reprex。
编辑:在深入研究问题之后,尤其是在这个问题上: 我发现了如何使用 allow.new.levels = TRUE 参数让 lmer 模型预测新的值组合在 predict() 函数中。
那么我的问题是,如何在 workflow() 或 fit_resamples()?
中指定它
我猜想在配方中添加 step_novel() 并在 add_recipe() 调用中添加 default_recipe_blueprint(allow_novel_levels = TRUE) 就足够了(如), 但似乎还是不行。
library(tidyverse)
library(tidymodels)
library(multilevelmod)
set.seed(1243)
data(mpg, package = "ggplot2")
training = mpg %>%
initial_split() %>%
training() %>%
mutate(manufacturer = manufacturer %>% as_factor(),
model = model %>% as_factor())
training_folds = training %>%
validation_split()
lmm_model = linear_reg() %>%
set_engine("lmer")
lmm_recipe = recipe(cty ~ year + manufacturer + model, data = training) %>%
step_novel(manufacturer, model)
lmm_formula = cty ~ year + (1|manufacturer/model)
lmm_workflow = workflow() %>%
# see:
add_recipe(lmm_recipe, blueprint = hardhat::default_recipe_blueprint(allow_novel_levels = TRUE)) %>%
add_model(lmm_model, formula = lmm_formula)
# A simple fit works
fit(lmm_workflow, training)
#> == Workflow [trained] ==========================================================
#> Preprocessor: Recipe
#> Model: linear_reg()
#>
#> -- Preprocessor ----------------------------------------------------------------
#> 1 Recipe Step
#>
#> * step_novel()
#>
#> -- Model -----------------------------------------------------------------------
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: cty ~ year + (1 | manufacturer/model)
#> Data: data
#> REML criterion at convergence: 864.9986
#> Random effects:
#> Groups Name Std.Dev.
#> model:manufacturer (Intercept) 2.881
#> manufacturer (Intercept) 2.675
#> Residual 2.181
#> Number of obs: 175, groups: model:manufacturer, 38; manufacturer, 15
#> Fixed Effects:
#> (Intercept) year
#> -8.06202 0.01228
# A fit with resamplings doesn't work:
fit_resamples(lmm_workflow, resamples = training_folds)
#> Warning: package 'lme4' was built under R version 4.1.1
#> x validation: preprocessor 1/1, model 1/1 (predictions): Error:
#> ! Assigned data `facto...
#> Warning: All models failed. See the `.notes` column.
#> # Resampling results
#> # Validation Set Split (0.75/0.25)
#> # A tibble: 1 x 4
#> splits id .metrics .notes
#> <list> <chr> <list> <list>
#> 1 <split [131/44]> validation <NULL> <tibble [1 x 3]>
#>
#> There were issues with some computations:
#>
#> - Error(s) x1: ! Assigned data `factor(lvl[1], levels = lvl)` must be compatibl...
#>
#> Use `collect_notes(object)` for more information.
# It seems that the problem is that the combinations of factor levels differ
# between analysis and assessment set
analysis_set = analysis(training_folds$splits[[1]])
assessment_set = assessment(training_folds$splits[[1]])
identical(
analysis_set %>% distinct(manufacturer, model),
assessment_set %>% distinct(manufacturer, model)
)
#> [1] FALSE
# directly fitting the model on the analysis set
analysis_fit = lmer(formula = lmm_formula, data = analysis_set)
# predicting the values for the missing combinations of levels is actually possible
# see:
assessment_predict = analysis_fit %>%
predict(training_folds$splits[[1]] %>% assessment(),
allow.new.levels = TRUE)
由 reprex package (v2.0.1)
创建于 2022-05-24
我认为问题在于您最终在训练集和测试集中得到了不同的因子水平(在 tidymodel 中称为重采样分析和评估):
library(tidymodels)
data(mpg, package = "ggplot2")
training <- mpg %>%
initial_split() %>%
training()
training_folds <- training %>%
validation_split()
training %>% count(manufacturer, model)
#> # A tibble: 38 × 3
#> manufacturer model n
#> <chr> <chr> <int>
#> 1 audi a4 5
#> 2 audi a4 quattro 5
#> 3 audi a6 quattro 3
#> 4 chevrolet c1500 suburban 2wd 4
#> 5 chevrolet corvette 4
#> 6 chevrolet k1500 tahoe 4wd 3
#> 7 chevrolet malibu 5
#> 8 dodge caravan 2wd 6
#> 9 dodge dakota pickup 4wd 8
#> 10 dodge durango 4wd 5
#> # … with 28 more rows
analysis(training_folds$splits[[1]]) %>% count(manufacturer, model)
#> # A tibble: 37 × 3
#> manufacturer model n
#> <chr> <chr> <int>
#> 1 audi a4 1
#> 2 audi a4 quattro 4
#> 3 audi a6 quattro 3
#> 4 chevrolet c1500 suburban 2wd 4
#> 5 chevrolet corvette 3
#> 6 chevrolet k1500 tahoe 4wd 2
#> 7 chevrolet malibu 4
#> 8 dodge caravan 2wd 5
#> 9 dodge dakota pickup 4wd 5
#> 10 dodge durango 4wd 4
#> # … with 27 more rows
assessment(training_folds$splits[[1]]) %>% count(manufacturer, model)
#> # A tibble: 28 × 3
#> manufacturer model n
#> <chr> <chr> <int>
#> 1 audi a4 4
#> 2 audi a4 quattro 1
#> 3 chevrolet corvette 1
#> 4 chevrolet k1500 tahoe 4wd 1
#> 5 chevrolet malibu 1
#> 6 dodge caravan 2wd 1
#> 7 dodge dakota pickup 4wd 3
#> 8 dodge durango 4wd 1
#> 9 dodge ram 1500 pickup 4wd 1
#> 10 ford expedition 2wd 3
#> # … with 18 more rows
由 reprex package (v2.0.1)
于 2022-05-23 创建
当您用所有训练数据拟合一次模型时,您有 38 种制造商和型号的组合。当您使用验证拆分进行拟合时,training/analysis 集中有 37 种组合,testing/assessment 集中有 28 种组合。其中一些不重叠,因此模型无法预测正在评估但正在分析的观察结果。
我正在尝试使用 tidymodels 包构建线性混合模型。看起来我正在以正确的方式指定公式,因为我可以 运行 在工作流程中使用 fit() 。 但是,当我尝试使用函数 fit_resamples() 运行 对我的数据进行重新采样时,我得到了一个与缺失因子水平相关的错误。
我不清楚我是否做错了什么,或者包“multilevelmod”和“tune”是否以这种方式不兼容,所以我将不胜感激任何建议。
我使用“mpg”数据集包含了一个 reprex。
编辑:在深入研究问题之后,尤其是在这个问题上:
我猜想在配方中添加 step_novel() 并在 add_recipe() 调用中添加 default_recipe_blueprint(allow_novel_levels = TRUE) 就足够了(如
library(tidyverse)
library(tidymodels)
library(multilevelmod)
set.seed(1243)
data(mpg, package = "ggplot2")
training = mpg %>%
initial_split() %>%
training() %>%
mutate(manufacturer = manufacturer %>% as_factor(),
model = model %>% as_factor())
training_folds = training %>%
validation_split()
lmm_model = linear_reg() %>%
set_engine("lmer")
lmm_recipe = recipe(cty ~ year + manufacturer + model, data = training) %>%
step_novel(manufacturer, model)
lmm_formula = cty ~ year + (1|manufacturer/model)
lmm_workflow = workflow() %>%
# see:
add_recipe(lmm_recipe, blueprint = hardhat::default_recipe_blueprint(allow_novel_levels = TRUE)) %>%
add_model(lmm_model, formula = lmm_formula)
# A simple fit works
fit(lmm_workflow, training)
#> == Workflow [trained] ==========================================================
#> Preprocessor: Recipe
#> Model: linear_reg()
#>
#> -- Preprocessor ----------------------------------------------------------------
#> 1 Recipe Step
#>
#> * step_novel()
#>
#> -- Model -----------------------------------------------------------------------
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: cty ~ year + (1 | manufacturer/model)
#> Data: data
#> REML criterion at convergence: 864.9986
#> Random effects:
#> Groups Name Std.Dev.
#> model:manufacturer (Intercept) 2.881
#> manufacturer (Intercept) 2.675
#> Residual 2.181
#> Number of obs: 175, groups: model:manufacturer, 38; manufacturer, 15
#> Fixed Effects:
#> (Intercept) year
#> -8.06202 0.01228
# A fit with resamplings doesn't work:
fit_resamples(lmm_workflow, resamples = training_folds)
#> Warning: package 'lme4' was built under R version 4.1.1
#> x validation: preprocessor 1/1, model 1/1 (predictions): Error:
#> ! Assigned data `facto...
#> Warning: All models failed. See the `.notes` column.
#> # Resampling results
#> # Validation Set Split (0.75/0.25)
#> # A tibble: 1 x 4
#> splits id .metrics .notes
#> <list> <chr> <list> <list>
#> 1 <split [131/44]> validation <NULL> <tibble [1 x 3]>
#>
#> There were issues with some computations:
#>
#> - Error(s) x1: ! Assigned data `factor(lvl[1], levels = lvl)` must be compatibl...
#>
#> Use `collect_notes(object)` for more information.
# It seems that the problem is that the combinations of factor levels differ
# between analysis and assessment set
analysis_set = analysis(training_folds$splits[[1]])
assessment_set = assessment(training_folds$splits[[1]])
identical(
analysis_set %>% distinct(manufacturer, model),
assessment_set %>% distinct(manufacturer, model)
)
#> [1] FALSE
# directly fitting the model on the analysis set
analysis_fit = lmer(formula = lmm_formula, data = analysis_set)
# predicting the values for the missing combinations of levels is actually possible
# see:
assessment_predict = analysis_fit %>%
predict(training_folds$splits[[1]] %>% assessment(),
allow.new.levels = TRUE)
由 reprex package (v2.0.1)
创建于 2022-05-24我认为问题在于您最终在训练集和测试集中得到了不同的因子水平(在 tidymodel 中称为重采样分析和评估):
library(tidymodels)
data(mpg, package = "ggplot2")
training <- mpg %>%
initial_split() %>%
training()
training_folds <- training %>%
validation_split()
training %>% count(manufacturer, model)
#> # A tibble: 38 × 3
#> manufacturer model n
#> <chr> <chr> <int>
#> 1 audi a4 5
#> 2 audi a4 quattro 5
#> 3 audi a6 quattro 3
#> 4 chevrolet c1500 suburban 2wd 4
#> 5 chevrolet corvette 4
#> 6 chevrolet k1500 tahoe 4wd 3
#> 7 chevrolet malibu 5
#> 8 dodge caravan 2wd 6
#> 9 dodge dakota pickup 4wd 8
#> 10 dodge durango 4wd 5
#> # … with 28 more rows
analysis(training_folds$splits[[1]]) %>% count(manufacturer, model)
#> # A tibble: 37 × 3
#> manufacturer model n
#> <chr> <chr> <int>
#> 1 audi a4 1
#> 2 audi a4 quattro 4
#> 3 audi a6 quattro 3
#> 4 chevrolet c1500 suburban 2wd 4
#> 5 chevrolet corvette 3
#> 6 chevrolet k1500 tahoe 4wd 2
#> 7 chevrolet malibu 4
#> 8 dodge caravan 2wd 5
#> 9 dodge dakota pickup 4wd 5
#> 10 dodge durango 4wd 4
#> # … with 27 more rows
assessment(training_folds$splits[[1]]) %>% count(manufacturer, model)
#> # A tibble: 28 × 3
#> manufacturer model n
#> <chr> <chr> <int>
#> 1 audi a4 4
#> 2 audi a4 quattro 1
#> 3 chevrolet corvette 1
#> 4 chevrolet k1500 tahoe 4wd 1
#> 5 chevrolet malibu 1
#> 6 dodge caravan 2wd 1
#> 7 dodge dakota pickup 4wd 3
#> 8 dodge durango 4wd 1
#> 9 dodge ram 1500 pickup 4wd 1
#> 10 ford expedition 2wd 3
#> # … with 18 more rows
由 reprex package (v2.0.1)
于 2022-05-23 创建当您用所有训练数据拟合一次模型时,您有 38 种制造商和型号的组合。当您使用验证拆分进行拟合时,training/analysis 集中有 37 种组合,testing/assessment 集中有 28 种组合。其中一些不重叠,因此模型无法预测正在评估但正在分析的观察结果。