当使用 purrr::map() 将模型拟合到新的列表列时,return 指示字符串如果拟合错误
When using purrr::map() to fit a model onto a new list-column, return indicative strings if fitting errors
我有数据要使用 purrr::map()
按组建模。有时,一个(或多个)子组的模型拟合失败。例如,错误可能是 contrasts can be applied only to factors with 2 or more levels
,或 glm.fit: algorithm did not converge
之类的警告,或者其他。
因为错误导致整个代码失败,所以我想创建一个条件:如果子组出现拟合错误,则该子组 return "string-of-choice"
;但对于确实产生模型的子组,对他们来说 return 模型对象。即使有关于收敛的警告,我更喜欢 "string-of-choice-2"
而不是非收敛模型。
虽然我的问题很笼统,但我提供了一些玩具数据示例以供演示。
例子
这是一个生成数据的函数。在此数据中,3 列对应人们回答的 3 个问题:
- 他们是否喜欢椰子(是或否,编码为
0
/1
。)
- 他们是否喜欢茄子
- 他们是否喜欢西红柿
此外,我们还有 id
列和 gender
列。
在这个数据的后续两个版本中,关于爱吃西红柿的栏目可能全是NA
,也可能全是0
。
generate_data <- function(x) {
data.frame(id = 1:2000,
do_u_love_coconut = sample(c(0, 1, NA), 2000, replace = TRUE, prob = c(0.2, 0.4, 0.4)),
do_u_love_eggplant = sample(c(0, 1, NA), 2000, replace = TRUE, prob = c(0.1, 0.5, 0.4)),
do_u_love_tomatoes = rep(x, 2000),
gender = sample(c("male", "female"), 2000, replace = TRUE))
}
## generate the data
set.seed(2021)
df_tomatoes_is_NA <- generate_data(NA)
df_tomatoes_is_zero <- generate_data(0)
## preview the data
library(tibble)
as_tibble(df_tomatoes_is_NA)
## # A tibble: 2,000 x 5
## id do_u_love_coconut do_u_love_eggplant do_u_love_tomatoes gender
## <int> <dbl> <dbl> <lgl> <chr>
## 1 1 NA NA NA male
## 2 2 NA NA NA male
## 3 3 NA NA NA male
## 4 4 1 1 NA female
## 5 5 NA 1 NA female
## 6 6 NA NA NA male
## 7 7 NA NA NA female
## 8 8 1 1 NA male
## 9 9 0 1 NA female
## 10 10 0 1 NA female
## # ... with 1,990 more rows
as_tibble(df_tomatoes_is_zero)
## # A tibble: 2,000 x 5
## id do_u_love_coconut do_u_love_eggplant do_u_love_tomatoes gender
## <int> <dbl> <dbl> <dbl> <chr>
## 1 1 NA 0 0 male
## 2 2 NA NA 0 male
## 3 3 1 NA 0 female
## 4 4 0 1 0 female
## 5 5 1 0 0 male
## 6 6 NA 0 0 female
## 7 7 1 1 0 male
## 8 8 1 NA 0 male
## 9 9 1 NA 0 male
## 10 10 0 1 0 female
## # ... with 1,990 more rows
拟合模型
所以现在我想按性别拟合每个 coconut/eggplant/tomatoes 的模型。
- 数据版本 1
library(tidyr)
library(purrr)
library(dplyr)
df_tomatoes_is_NA %>%
pivot_longer(starts_with("do_u")) %>%
group_by(name) %>%
nest() %>%
mutate(fit_and_predict = map(data,
~ glm(formula = value ~ gender, data = .x, family = "binomial")))
Error: Problem with mutate()
input fit_and_predict
.
x contrasts can be applied only to factors with 2 or more levels
i Input fit_and_predict
is map(data, ~glm(formula = value ~ gender, data = .x, family = "binomial"))
.
i The error occurred in group 3: name = "do_u_love_tomatoes".
- 版本 2 的数据
df_tomatoes_is_zero %>%
pivot_longer(starts_with("do_u")) %>%
group_by(name) %>%
nest() %>%
mutate(fit_and_predict = map(data,
~ glm(formula = value ~ gender, data = .x, family = "binomial")))
# A tibble: 3 x 3
# Groups: name [3]
name data fit_and_predict
<chr> <list> <list>
1 do_u_love_coconut <tibble [2,000 x 3]> <glm>
2 do_u_love_eggplant <tibble [2,000 x 3]> <glm>
3 do_u_love_tomatoes <tibble [2,000 x 3]> <glm>
Warning message:
Problem with mutate()
input fit_and_predict
.
i glm.fit: algorithm did not converge
i Input fit_and_predict
is map(data, ~glm(formula = value ~ gender, data = .x, family = "binomial"))
.
i The error occurred in group 3: name = "do_u_love_tomatoes".
我的问题
我想考虑潜在的拟合错误并决定此类失败中的 return 值应该是多少。例如,对于诸如 contrasts can be applied only to factors with 2 or more levels
之类的错误,我希望将 "contrasts_error"
作为 return 值。预期输出例如:
df_tomatoes_is_NA %>%
pivot_longer(starts_with("do_u")) %>%
group_by(name) %>%
nest() %>%
mutate(fit_and_predict = map(data,
~ glm(formula = value ~ gender, data = .x, family = "binomial")))
## # A tibble: 3 x 3
## # Groups: name [3]
## name data fit_and_predict
## <chr> <list> <list>
## 1 do_u_love_coconut <tibble [2,000 x 3]> <glm>
## 2 do_u_love_eggplant <tibble [2,000 x 3]> <glm>
## 3 do_u_love_tomatoes <tibble [2,000 x 3]> <chr[1]> <-- "contrasts_error"
如果有收敛警告,例如 glm.fit: algorithm did not converge
,我希望:
df_tomatoes_is_zero %>%
pivot_longer(starts_with("do_u")) %>%
group_by(name) %>%
nest() %>%
mutate(fit_and_predict = map(data,
~ glm(formula = value ~ gender, data = .x, family = "binomial")))
## # A tibble: 3 x 3
## # Groups: name [3]
## name data fit_and_predict
## <chr> <list> <list>
## 1 do_u_love_coconut <tibble [2,000 x 3]> <glm>
## 2 do_u_love_eggplant <tibble [2,000 x 3]> <glm>
## 3 do_u_love_tomatoes <tibble [2,000 x 3]> <chr[1]> <-- "convergence_warning"
编辑
澄清一下,由于存在许多潜在的拟合错误和警告,解决方案总是需要我在代码中指定每个潜在的 error/warning 及其各自的字符串。我在上面给出了两个例子(对比错误和收敛警告)。
您可以使用 tryCatch
处理此问题并捕获所有警告和错误以及它们的 return 相应输出。
apply_glm <- function(data, formula) {
tryCatch(glm(formula = formula, data = data, family = "binomial"), error = function(e) {
if(e$message == "contrasts can be applied only to factors with 2 or more levels")
return('contrasts error')
}, warning = function(w) {
if(w$message == "glm.fit: algorithm did not converge")
return('convergence warning')
})
}
您可以使用 if
/else if
或 case_when
语句扩展对错误和警告消息的处理。
为数据集应用函数 df_tomatoes_is_NA
:
library(tidyverse)
df_tomatoes_is_NA %>%
pivot_longer(starts_with("do_u")) %>%
group_by(name) %>%
nest() %>%
mutate(fit_and_predict = map(data, apply_glm, value ~ gender)) -> result1
result1
# name data fit_and_predict
# <chr> <list> <list>
#1 do_u_love_coconut <tibble [2,000 × 3]> <glm>
#2 do_u_love_eggplant <tibble [2,000 × 3]> <glm>
#3 do_u_love_tomatoes <tibble [2,000 × 3]> <chr [1]>
result1$fit_and_predict
[[1]]
Call: glm(formula = value ~ gender, family = "binomial", data = data)
Coefficients:
(Intercept) gendermale
0.68837 -0.08838
Degrees of Freedom: 1214 Total (i.e. Null); 1213 Residual
(785 observations deleted due to missingness)
Null Deviance: 1564
Residual Deviance: 1564 AIC: 1568
[[2]]
Call: glm(formula = value ~ gender, family = "binomial", data = data)
Coefficients:
(Intercept) gendermale
1.61343 -0.01754
Degrees of Freedom: 1218 Total (i.e. Null); 1217 Residual
(781 observations deleted due to missingness)
Null Deviance: 1101
Residual Deviance: 1101 AIC: 1105
[[3]]
[1] "contrasts error"
对于数据集df_tomatoes_is_zero
:
df_tomatoes_is_zero %>%
pivot_longer(starts_with("do_u")) %>%
group_by(name) %>%
nest() %>%
mutate(fit_and_predict = map(data,apply_glm)) -> result2
result2
# name data fit_and_predict
# <chr> <list> <list>
#1 do_u_love_coconut <tibble [2,000 × 3]> <glm>
#2 do_u_love_eggplant <tibble [2,000 × 3]> <glm>
#3 do_u_love_tomatoes <tibble [2,000 × 3]> <chr [1]>
result2$fit_and_predict
[[1]]
Call: glm(formula = value ~ gender, family = "binomial", data = data)
Coefficients:
(Intercept) gendermale
0.49372 0.07442
Degrees of Freedom: 1190 Total (i.e. Null); 1189 Residual
(809 observations deleted due to missingness)
Null Deviance: 1570
Residual Deviance: 1570 AIC: 1574
[[2]]
Call: glm(formula = value ~ gender, family = "binomial", data = data)
Coefficients:
(Intercept) gendermale
1.60539 -0.03636
Degrees of Freedom: 1177 Total (i.e. Null); 1176 Residual
(822 observations deleted due to missingness)
Null Deviance: 1073
Residual Deviance: 1073 AIC: 1077
[[3]]
[1] "convergence warning"
我有数据要使用 purrr::map()
按组建模。有时,一个(或多个)子组的模型拟合失败。例如,错误可能是 contrasts can be applied only to factors with 2 or more levels
,或 glm.fit: algorithm did not converge
之类的警告,或者其他。
因为错误导致整个代码失败,所以我想创建一个条件:如果子组出现拟合错误,则该子组 return "string-of-choice"
;但对于确实产生模型的子组,对他们来说 return 模型对象。即使有关于收敛的警告,我更喜欢 "string-of-choice-2"
而不是非收敛模型。
虽然我的问题很笼统,但我提供了一些玩具数据示例以供演示。
例子
这是一个生成数据的函数。在此数据中,3 列对应人们回答的 3 个问题:
- 他们是否喜欢椰子(是或否,编码为
0
/1
。) - 他们是否喜欢茄子
- 他们是否喜欢西红柿
此外,我们还有 id
列和 gender
列。
在这个数据的后续两个版本中,关于爱吃西红柿的栏目可能全是NA
,也可能全是0
。
generate_data <- function(x) {
data.frame(id = 1:2000,
do_u_love_coconut = sample(c(0, 1, NA), 2000, replace = TRUE, prob = c(0.2, 0.4, 0.4)),
do_u_love_eggplant = sample(c(0, 1, NA), 2000, replace = TRUE, prob = c(0.1, 0.5, 0.4)),
do_u_love_tomatoes = rep(x, 2000),
gender = sample(c("male", "female"), 2000, replace = TRUE))
}
## generate the data
set.seed(2021)
df_tomatoes_is_NA <- generate_data(NA)
df_tomatoes_is_zero <- generate_data(0)
## preview the data
library(tibble)
as_tibble(df_tomatoes_is_NA)
## # A tibble: 2,000 x 5
## id do_u_love_coconut do_u_love_eggplant do_u_love_tomatoes gender
## <int> <dbl> <dbl> <lgl> <chr>
## 1 1 NA NA NA male
## 2 2 NA NA NA male
## 3 3 NA NA NA male
## 4 4 1 1 NA female
## 5 5 NA 1 NA female
## 6 6 NA NA NA male
## 7 7 NA NA NA female
## 8 8 1 1 NA male
## 9 9 0 1 NA female
## 10 10 0 1 NA female
## # ... with 1,990 more rows
as_tibble(df_tomatoes_is_zero)
## # A tibble: 2,000 x 5
## id do_u_love_coconut do_u_love_eggplant do_u_love_tomatoes gender
## <int> <dbl> <dbl> <dbl> <chr>
## 1 1 NA 0 0 male
## 2 2 NA NA 0 male
## 3 3 1 NA 0 female
## 4 4 0 1 0 female
## 5 5 1 0 0 male
## 6 6 NA 0 0 female
## 7 7 1 1 0 male
## 8 8 1 NA 0 male
## 9 9 1 NA 0 male
## 10 10 0 1 0 female
## # ... with 1,990 more rows
拟合模型
所以现在我想按性别拟合每个 coconut/eggplant/tomatoes 的模型。
- 数据版本 1
library(tidyr)
library(purrr)
library(dplyr)
df_tomatoes_is_NA %>%
pivot_longer(starts_with("do_u")) %>%
group_by(name) %>%
nest() %>%
mutate(fit_and_predict = map(data,
~ glm(formula = value ~ gender, data = .x, family = "binomial")))
Error: Problem with
mutate()
inputfit_and_predict
.
x contrasts can be applied only to factors with 2 or more levels
i Inputfit_and_predict
ismap(data, ~glm(formula = value ~ gender, data = .x, family = "binomial"))
.
i The error occurred in group 3: name = "do_u_love_tomatoes".
- 版本 2 的数据
df_tomatoes_is_zero %>%
pivot_longer(starts_with("do_u")) %>%
group_by(name) %>%
nest() %>%
mutate(fit_and_predict = map(data,
~ glm(formula = value ~ gender, data = .x, family = "binomial")))
# A tibble: 3 x 3
# Groups: name [3]
name data fit_and_predict
<chr> <list> <list>
1 do_u_love_coconut <tibble [2,000 x 3]> <glm>
2 do_u_love_eggplant <tibble [2,000 x 3]> <glm>
3 do_u_love_tomatoes <tibble [2,000 x 3]> <glm>
Warning message:
Problem withmutate()
inputfit_and_predict
.
i glm.fit: algorithm did not converge
i Inputfit_and_predict
ismap(data, ~glm(formula = value ~ gender, data = .x, family = "binomial"))
.
i The error occurred in group 3: name = "do_u_love_tomatoes".
我的问题
我想考虑潜在的拟合错误并决定此类失败中的 return 值应该是多少。例如,对于诸如 contrasts can be applied only to factors with 2 or more levels
之类的错误,我希望将 "contrasts_error"
作为 return 值。预期输出例如:
df_tomatoes_is_NA %>%
pivot_longer(starts_with("do_u")) %>%
group_by(name) %>%
nest() %>%
mutate(fit_and_predict = map(data,
~ glm(formula = value ~ gender, data = .x, family = "binomial")))
## # A tibble: 3 x 3
## # Groups: name [3]
## name data fit_and_predict
## <chr> <list> <list>
## 1 do_u_love_coconut <tibble [2,000 x 3]> <glm>
## 2 do_u_love_eggplant <tibble [2,000 x 3]> <glm>
## 3 do_u_love_tomatoes <tibble [2,000 x 3]> <chr[1]> <-- "contrasts_error"
如果有收敛警告,例如 glm.fit: algorithm did not converge
,我希望:
df_tomatoes_is_zero %>%
pivot_longer(starts_with("do_u")) %>%
group_by(name) %>%
nest() %>%
mutate(fit_and_predict = map(data,
~ glm(formula = value ~ gender, data = .x, family = "binomial")))
## # A tibble: 3 x 3
## # Groups: name [3]
## name data fit_and_predict
## <chr> <list> <list>
## 1 do_u_love_coconut <tibble [2,000 x 3]> <glm>
## 2 do_u_love_eggplant <tibble [2,000 x 3]> <glm>
## 3 do_u_love_tomatoes <tibble [2,000 x 3]> <chr[1]> <-- "convergence_warning"
编辑
澄清一下,由于存在许多潜在的拟合错误和警告,解决方案总是需要我在代码中指定每个潜在的 error/warning 及其各自的字符串。我在上面给出了两个例子(对比错误和收敛警告)。
您可以使用 tryCatch
处理此问题并捕获所有警告和错误以及它们的 return 相应输出。
apply_glm <- function(data, formula) {
tryCatch(glm(formula = formula, data = data, family = "binomial"), error = function(e) {
if(e$message == "contrasts can be applied only to factors with 2 or more levels")
return('contrasts error')
}, warning = function(w) {
if(w$message == "glm.fit: algorithm did not converge")
return('convergence warning')
})
}
您可以使用 if
/else if
或 case_when
语句扩展对错误和警告消息的处理。
为数据集应用函数 df_tomatoes_is_NA
:
library(tidyverse)
df_tomatoes_is_NA %>%
pivot_longer(starts_with("do_u")) %>%
group_by(name) %>%
nest() %>%
mutate(fit_and_predict = map(data, apply_glm, value ~ gender)) -> result1
result1
# name data fit_and_predict
# <chr> <list> <list>
#1 do_u_love_coconut <tibble [2,000 × 3]> <glm>
#2 do_u_love_eggplant <tibble [2,000 × 3]> <glm>
#3 do_u_love_tomatoes <tibble [2,000 × 3]> <chr [1]>
result1$fit_and_predict
[[1]]
Call: glm(formula = value ~ gender, family = "binomial", data = data)
Coefficients:
(Intercept) gendermale
0.68837 -0.08838
Degrees of Freedom: 1214 Total (i.e. Null); 1213 Residual
(785 observations deleted due to missingness)
Null Deviance: 1564
Residual Deviance: 1564 AIC: 1568
[[2]]
Call: glm(formula = value ~ gender, family = "binomial", data = data)
Coefficients:
(Intercept) gendermale
1.61343 -0.01754
Degrees of Freedom: 1218 Total (i.e. Null); 1217 Residual
(781 observations deleted due to missingness)
Null Deviance: 1101
Residual Deviance: 1101 AIC: 1105
[[3]]
[1] "contrasts error"
对于数据集df_tomatoes_is_zero
:
df_tomatoes_is_zero %>%
pivot_longer(starts_with("do_u")) %>%
group_by(name) %>%
nest() %>%
mutate(fit_and_predict = map(data,apply_glm)) -> result2
result2
# name data fit_and_predict
# <chr> <list> <list>
#1 do_u_love_coconut <tibble [2,000 × 3]> <glm>
#2 do_u_love_eggplant <tibble [2,000 × 3]> <glm>
#3 do_u_love_tomatoes <tibble [2,000 × 3]> <chr [1]>
result2$fit_and_predict
[[1]]
Call: glm(formula = value ~ gender, family = "binomial", data = data)
Coefficients:
(Intercept) gendermale
0.49372 0.07442
Degrees of Freedom: 1190 Total (i.e. Null); 1189 Residual
(809 observations deleted due to missingness)
Null Deviance: 1570
Residual Deviance: 1570 AIC: 1574
[[2]]
Call: glm(formula = value ~ gender, family = "binomial", data = data)
Coefficients:
(Intercept) gendermale
1.60539 -0.03636
Degrees of Freedom: 1177 Total (i.e. Null); 1176 Residual
(822 observations deleted due to missingness)
Null Deviance: 1073
Residual Deviance: 1073 AIC: 1077
[[3]]
[1] "convergence warning"