为什么重新调整因子变量会在 ggeffects::ggemmeans() 中给出错误的输出(预测 table)?
How come releveling a factor variable gives wrong output (predictions table) in ggeffects::ggemmeans()?
我使用 ggeffects::ggemmeans()
从模型中获取预测,但我不知道我是否发现了错误或做错了什么。在模型中使用 因子 变量作为预测变量时,ggemmeans()
的输出在重新调整因子时变得混乱。
例子
下面有两种情况,a
和 b
,我将一个数据列转换为一个因子,然后用 lm()
拟合模型,最后用 ggemmeans()
.
library(ggplot2)
library(dplyr)
library(emmeans)
library(ggeffects)
# scenario a
# step a1 -- convert manufacturer col to factor
my_mpg_manuf_as_fac_a <-
mpg %>%
mutate(across(manufacturer, factor))
levels(my_mpg_manuf_as_fac_a$manufacturer) ## the levels are ordered alphabetically
#> [1] "audi" "chevrolet" "dodge" "ford" "honda"
#> [6] "hyundai" "jeep" "land rover" "lincoln" "mercury"
#> [11] "nissan" "pontiac" "subaru" "toyota" "volkswagen"
# step a2 -- model and get predictions
pred_a <-
my_mpg_manuf_as_fac_a %>%
lm(cty ~ manufacturer, data = .) %>%
ggemmeans(terms = "manufacturer")
pred_a
#> # Predicted values of cty
#> # x = manufacturer
#>
#> x | Predicted | 95% CI
#> ---------------------------------------
#> audi | 17.61 | [16.25, 18.97]
#> dodge | 13.14 | [12.19, 14.08]
#> ford | 14.00 | [12.85, 15.15]
#> hyundai | 18.64 | [17.10, 20.18]
#> land rover | 11.50 | [ 8.62, 14.38]
#> mercury | 13.25 | [10.37, 16.13]
#> pontiac | 17.00 | [14.42, 19.58]
#> volkswagen | 20.93 | [19.82, 22.04]
# scenario b
# step b1 -- convert manufacturer col to factor (same as step a1)
my_mpg_manuf_as_fac_b <-
mpg %>%
mutate(across(manufacturer, factor))
# step b2 -- change the order of levels in manufacturer
levels(my_mpg_manuf_as_fac_b$manufacturer) <- sort(levels(my_mpg_manuf_as_fac_b$manufacturer), decreasing = TRUE)
levels(my_mpg_manuf_as_fac_b$manufacturer) ## order of levels is now reveresed
#> [1] "volkswagen" "toyota" "subaru" "pontiac" "nissan"
#> [6] "mercury" "lincoln" "land rover" "jeep" "hyundai"
#> [11] "honda" "ford" "dodge" "chevrolet" "audi"
# step b3 -- model and get predictions
pred_b <-
my_mpg_manuf_as_fac_b %>%
lm(cty ~ manufacturer, data = .) %>%
ggemmeans(terms = "manufacturer")
pred_b
#> # Predicted values of cty
#> # x = manufacturer
#>
#> x | Predicted | 95% CI
#> ---------------------------------------
#> volkswagen | 17.61 | [16.25, 18.97]
#> subaru | 13.14 | [12.19, 14.08]
#> pontiac | 14.00 | [12.85, 15.15]
#> mercury | 18.64 | [17.10, 20.18]
#> land rover | 11.50 | [ 8.62, 14.38]
#> hyundai | 13.25 | [10.37, 16.13]
#> ford | 17.00 | [14.42, 19.58]
#> audi | 20.93 | [19.82, 22.04]
由 reprex package (v0.3.0)
于 2021-05-03 创建
当我们比较 pred_a
和 pred_b
时,很容易看出 Predicted
和 95% CI
列中的值保持不变,即使 [=38= x
列中名称的顺序 已更改。
pred_a
## # Predicted values of cty
## # x = manufacturer
## x | Predicted | 95% CI
## ---------------------------------------
## audi | 17.61 | [16.25, 18.97]
## dodge | 13.14 | [12.19, 14.08]
## ford | 14.00 | [12.85, 15.15]
## hyundai | 18.64 | [17.10, 20.18]
## land rover | 11.50 | [ 8.62, 14.38]
## mercury | 13.25 | [10.37, 16.13]
## pontiac | 17.00 | [14.42, 19.58]
## volkswagen | 20.93 | [19.82, 22.04]
pred_b
## # Predicted values of cty
## # x = manufacturer
## x | Predicted | 95% CI
## ---------------------------------------
## volkswagen | 17.61 | [16.25, 18.97]
## subaru | 13.14 | [12.19, 14.08]
## pontiac | 14.00 | [12.85, 15.15]
## mercury | 18.64 | [17.10, 20.18]
## land rover | 11.50 | [ 8.62, 14.38]
## hyundai | 13.25 | [10.37, 16.13]
## ford | 17.00 | [14.42, 19.58]
## audi | 20.93 | [19.82, 22.04]
这是一个错误还是我做错了什么?
您应该改为使用 factor()
函数重新调整级别,因为 levels()
并不能真正看到基础数据。当您使用 levels()
时,您的整个数据都会发生变化:audi
变为 volkswagen
,等等。但是通过将原始向量传递给 factor()
,您将保留值本身。
数据:
manufacturers=c("audi","chevrolet","subaru","toyota","volkswagen")
df = data.frame(mpg = runif(length(manufacturers)*2, 30, 50), manufacturer = rep(manufacturers, each = 2), stringsAsFactors = TRUE)
之前:
> df$manufacturer
[1] audi audi chevrolet chevrolet subaru subaru toyota toyota volkswagen volkswagen
Levels: audi chevrolet subaru toyota volkswagen
之后:
df$manufacturer = factor(df$manufacturer, levels = sort(levels(df$manufacturer),decreasing = T))
> df$manufacturer
[1] audi audi chevrolet chevrolet subaru subaru toyota toyota volkswagen volkswagen
Levels: volkswagen toyota subaru chevrolet audi
将此与以下内容进行比较:
df = data.frame(mpg = runif(length(manufacturers)*2, 30, 50), manufacturer = rep(manufacturers, each = 2), stringsAsFactors = TRUE)
levels(df$manufacturer) = sort(levels(df$manufacturer),decreasing = T)
> df$manufacturer
[1] volkswagen volkswagen toyota toyota subaru subaru chevrolet chevrolet audi audi
Levels: volkswagen toyota subaru chevrolet audi
重命名了整个向量。
我使用 ggeffects::ggemmeans()
从模型中获取预测,但我不知道我是否发现了错误或做错了什么。在模型中使用 因子 变量作为预测变量时,ggemmeans()
的输出在重新调整因子时变得混乱。
例子
下面有两种情况,a
和 b
,我将一个数据列转换为一个因子,然后用 lm()
拟合模型,最后用 ggemmeans()
.
library(ggplot2)
library(dplyr)
library(emmeans)
library(ggeffects)
# scenario a
# step a1 -- convert manufacturer col to factor
my_mpg_manuf_as_fac_a <-
mpg %>%
mutate(across(manufacturer, factor))
levels(my_mpg_manuf_as_fac_a$manufacturer) ## the levels are ordered alphabetically
#> [1] "audi" "chevrolet" "dodge" "ford" "honda"
#> [6] "hyundai" "jeep" "land rover" "lincoln" "mercury"
#> [11] "nissan" "pontiac" "subaru" "toyota" "volkswagen"
# step a2 -- model and get predictions
pred_a <-
my_mpg_manuf_as_fac_a %>%
lm(cty ~ manufacturer, data = .) %>%
ggemmeans(terms = "manufacturer")
pred_a
#> # Predicted values of cty
#> # x = manufacturer
#>
#> x | Predicted | 95% CI
#> ---------------------------------------
#> audi | 17.61 | [16.25, 18.97]
#> dodge | 13.14 | [12.19, 14.08]
#> ford | 14.00 | [12.85, 15.15]
#> hyundai | 18.64 | [17.10, 20.18]
#> land rover | 11.50 | [ 8.62, 14.38]
#> mercury | 13.25 | [10.37, 16.13]
#> pontiac | 17.00 | [14.42, 19.58]
#> volkswagen | 20.93 | [19.82, 22.04]
# scenario b
# step b1 -- convert manufacturer col to factor (same as step a1)
my_mpg_manuf_as_fac_b <-
mpg %>%
mutate(across(manufacturer, factor))
# step b2 -- change the order of levels in manufacturer
levels(my_mpg_manuf_as_fac_b$manufacturer) <- sort(levels(my_mpg_manuf_as_fac_b$manufacturer), decreasing = TRUE)
levels(my_mpg_manuf_as_fac_b$manufacturer) ## order of levels is now reveresed
#> [1] "volkswagen" "toyota" "subaru" "pontiac" "nissan"
#> [6] "mercury" "lincoln" "land rover" "jeep" "hyundai"
#> [11] "honda" "ford" "dodge" "chevrolet" "audi"
# step b3 -- model and get predictions
pred_b <-
my_mpg_manuf_as_fac_b %>%
lm(cty ~ manufacturer, data = .) %>%
ggemmeans(terms = "manufacturer")
pred_b
#> # Predicted values of cty
#> # x = manufacturer
#>
#> x | Predicted | 95% CI
#> ---------------------------------------
#> volkswagen | 17.61 | [16.25, 18.97]
#> subaru | 13.14 | [12.19, 14.08]
#> pontiac | 14.00 | [12.85, 15.15]
#> mercury | 18.64 | [17.10, 20.18]
#> land rover | 11.50 | [ 8.62, 14.38]
#> hyundai | 13.25 | [10.37, 16.13]
#> ford | 17.00 | [14.42, 19.58]
#> audi | 20.93 | [19.82, 22.04]
由 reprex package (v0.3.0)
于 2021-05-03 创建当我们比较 pred_a
和 pred_b
时,很容易看出 Predicted
和 95% CI
列中的值保持不变,即使 [=38= x
列中名称的顺序 已更改。
pred_a
## # Predicted values of cty
## # x = manufacturer
## x | Predicted | 95% CI
## ---------------------------------------
## audi | 17.61 | [16.25, 18.97]
## dodge | 13.14 | [12.19, 14.08]
## ford | 14.00 | [12.85, 15.15]
## hyundai | 18.64 | [17.10, 20.18]
## land rover | 11.50 | [ 8.62, 14.38]
## mercury | 13.25 | [10.37, 16.13]
## pontiac | 17.00 | [14.42, 19.58]
## volkswagen | 20.93 | [19.82, 22.04]
pred_b
## # Predicted values of cty
## # x = manufacturer
## x | Predicted | 95% CI
## ---------------------------------------
## volkswagen | 17.61 | [16.25, 18.97]
## subaru | 13.14 | [12.19, 14.08]
## pontiac | 14.00 | [12.85, 15.15]
## mercury | 18.64 | [17.10, 20.18]
## land rover | 11.50 | [ 8.62, 14.38]
## hyundai | 13.25 | [10.37, 16.13]
## ford | 17.00 | [14.42, 19.58]
## audi | 20.93 | [19.82, 22.04]
这是一个错误还是我做错了什么?
您应该改为使用 factor()
函数重新调整级别,因为 levels()
并不能真正看到基础数据。当您使用 levels()
时,您的整个数据都会发生变化:audi
变为 volkswagen
,等等。但是通过将原始向量传递给 factor()
,您将保留值本身。
数据:
manufacturers=c("audi","chevrolet","subaru","toyota","volkswagen")
df = data.frame(mpg = runif(length(manufacturers)*2, 30, 50), manufacturer = rep(manufacturers, each = 2), stringsAsFactors = TRUE)
之前:
> df$manufacturer
[1] audi audi chevrolet chevrolet subaru subaru toyota toyota volkswagen volkswagen
Levels: audi chevrolet subaru toyota volkswagen
之后:
df$manufacturer = factor(df$manufacturer, levels = sort(levels(df$manufacturer),decreasing = T))
> df$manufacturer
[1] audi audi chevrolet chevrolet subaru subaru toyota toyota volkswagen volkswagen
Levels: volkswagen toyota subaru chevrolet audi
将此与以下内容进行比较:
df = data.frame(mpg = runif(length(manufacturers)*2, 30, 50), manufacturer = rep(manufacturers, each = 2), stringsAsFactors = TRUE)
levels(df$manufacturer) = sort(levels(df$manufacturer),decreasing = T)
> df$manufacturer
[1] volkswagen volkswagen toyota toyota subaru subaru chevrolet chevrolet audi audi
Levels: volkswagen toyota subaru chevrolet audi
重命名了整个向量。