为什么重新调整因子变量会在 ggeffects::ggemmeans() 中给出错误的输出(预测 table)?

How come releveling a factor variable gives wrong output (predictions table) in ggeffects::ggemmeans()?

我使用 ggeffects::ggemmeans() 从模型中获取预测,但我不知道我是否发现了错误或做错了什么。在模型中使用 因子 变量作为预测变量时,ggemmeans() 的输出在重新调整因子时变得混乱。

例子

下面有两种情况,ab,我将一个数据列转换为一个因子,然后用 lm() 拟合模型,最后用 ggemmeans().

library(ggplot2)
library(dplyr)
library(emmeans)
library(ggeffects)

# scenario a
# step a1 -- convert manufacturer col to factor
my_mpg_manuf_as_fac_a <- 
  mpg %>%
  mutate(across(manufacturer, factor))

levels(my_mpg_manuf_as_fac_a$manufacturer) ## the levels are ordered alphabetically

#>  [1] "audi"       "chevrolet"  "dodge"      "ford"       "honda"     
#>  [6] "hyundai"    "jeep"       "land rover" "lincoln"    "mercury"   
#> [11] "nissan"     "pontiac"    "subaru"     "toyota"     "volkswagen"

# step a2 -- model and get predictions
pred_a <-
  my_mpg_manuf_as_fac_a %>%
  lm(cty ~ manufacturer, data = .) %>%
  ggemmeans(terms = "manufacturer")

pred_a

#> # Predicted values of cty
#> # x = manufacturer
#> 
#> x          | Predicted |         95% CI
#> ---------------------------------------
#> audi       |     17.61 | [16.25, 18.97]
#> dodge      |     13.14 | [12.19, 14.08]
#> ford       |     14.00 | [12.85, 15.15]
#> hyundai    |     18.64 | [17.10, 20.18]
#> land rover |     11.50 | [ 8.62, 14.38]
#> mercury    |     13.25 | [10.37, 16.13]
#> pontiac    |     17.00 | [14.42, 19.58]
#> volkswagen |     20.93 | [19.82, 22.04]


# scenario b
# step b1 -- convert manufacturer col to factor (same as step a1)
my_mpg_manuf_as_fac_b <- 
  mpg %>%
  mutate(across(manufacturer, factor))

# step b2 -- change the order of levels in manufacturer
levels(my_mpg_manuf_as_fac_b$manufacturer) <- sort(levels(my_mpg_manuf_as_fac_b$manufacturer), decreasing = TRUE)

levels(my_mpg_manuf_as_fac_b$manufacturer) ## order of levels is now reveresed

#>  [1] "volkswagen" "toyota"     "subaru"     "pontiac"    "nissan"    
#>  [6] "mercury"    "lincoln"    "land rover" "jeep"       "hyundai"   
#> [11] "honda"      "ford"       "dodge"      "chevrolet"  "audi"

# step b3 -- model and get predictions
pred_b <-
  my_mpg_manuf_as_fac_b %>%
  lm(cty ~ manufacturer, data = .) %>%
  ggemmeans(terms = "manufacturer")

pred_b

#> # Predicted values of cty
#> # x = manufacturer
#> 
#> x          | Predicted |         95% CI
#> ---------------------------------------
#> volkswagen |     17.61 | [16.25, 18.97]
#> subaru     |     13.14 | [12.19, 14.08]
#> pontiac    |     14.00 | [12.85, 15.15]
#> mercury    |     18.64 | [17.10, 20.18]
#> land rover |     11.50 | [ 8.62, 14.38]
#> hyundai    |     13.25 | [10.37, 16.13]
#> ford       |     17.00 | [14.42, 19.58]
#> audi       |     20.93 | [19.82, 22.04]

reprex package (v0.3.0)

于 2021-05-03 创建

当我们比较 pred_apred_b 时,很容易看出 Predicted95% CI 列中的值保持不变,即使 [=38= x 列中名称的顺序 已更改。

pred_a

## # Predicted values of cty
## # x = manufacturer

## x          | Predicted |         95% CI
## ---------------------------------------
## audi       |     17.61 | [16.25, 18.97]
## dodge      |     13.14 | [12.19, 14.08]
## ford       |     14.00 | [12.85, 15.15]
## hyundai    |     18.64 | [17.10, 20.18]
## land rover |     11.50 | [ 8.62, 14.38]
## mercury    |     13.25 | [10.37, 16.13]
## pontiac    |     17.00 | [14.42, 19.58]
## volkswagen |     20.93 | [19.82, 22.04]


pred_b

## # Predicted values of cty
## # x = manufacturer

## x          | Predicted |         95% CI
## ---------------------------------------
## volkswagen |     17.61 | [16.25, 18.97]
## subaru     |     13.14 | [12.19, 14.08]
## pontiac    |     14.00 | [12.85, 15.15]
## mercury    |     18.64 | [17.10, 20.18]
## land rover |     11.50 | [ 8.62, 14.38]
## hyundai    |     13.25 | [10.37, 16.13]
## ford       |     17.00 | [14.42, 19.58]
## audi       |     20.93 | [19.82, 22.04]

这是一个错误还是我做错了什么?

您应该改为使用 factor() 函数重新调整级别,因为 levels() 并不能真正看到基础数据。当您使用 levels() 时,您的整个数据都会发生变化:audi 变为 volkswagen,等等。但是通过将原始向量传递给 factor(),您将保留值本身。

数据:

manufacturers=c("audi","chevrolet","subaru","toyota","volkswagen")
df = data.frame(mpg = runif(length(manufacturers)*2, 30, 50), manufacturer = rep(manufacturers, each = 2), stringsAsFactors = TRUE)

之前:

> df$manufacturer
[1] audi       audi       chevrolet  chevrolet  subaru     subaru     toyota     toyota     volkswagen volkswagen
Levels: audi chevrolet subaru toyota volkswagen

之后:

df$manufacturer = factor(df$manufacturer, levels = sort(levels(df$manufacturer),decreasing = T))
> df$manufacturer
[1] audi       audi       chevrolet  chevrolet  subaru     subaru     toyota     toyota     volkswagen volkswagen
Levels: volkswagen toyota subaru chevrolet audi

将此与以下内容进行比较:

df = data.frame(mpg = runif(length(manufacturers)*2, 30, 50), manufacturer = rep(manufacturers, each = 2), stringsAsFactors = TRUE)
levels(df$manufacturer) = sort(levels(df$manufacturer),decreasing = T)

> df$manufacturer
[1] volkswagen volkswagen toyota     toyota     subaru     subaru     chevrolet  chevrolet  audi       audi      
Levels: volkswagen toyota subaru chevrolet audi

重命名了整个向量。