重新调整因子和带有效果编码的 glm

Question

我无法理解 glm 的效果编码。例如：

data('mpg')
mpg$trans = as.factor(mpg$trans)
levels(mpg$trans)
[1] "auto(av)"   "auto(l3)"   "auto(l4)"   "auto(l5)"   "auto(l6)"   "auto(s4)"   "auto(s5)"   "auto(s6)"   "manual(m5)" "manual(m6)"

对于效果（或虚拟）编码，glm 将第一级作为参考级，因此在本例中它将是 "auto(av)"。

mpg_glm = glm(hwy ~ trans, data = mpg, contrasts = list(trans = contr.sum))
summary(mpg_glm)
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  24.4173     0.7318  33.366  < 2e-16 ***
trans1        3.3827     2.3592   1.434 0.153017    
trans2        2.5827     3.6210   0.713 0.476426    
trans3       -2.4534     0.9157  -2.679 0.007928 ** 
trans4       -3.6993     1.0865  -3.405 0.000784 ***
trans5       -4.4173     2.1743  -2.032 0.043375 *  
trans6        1.2494     2.9866   0.418 0.676105    
trans7        0.9160     2.9866   0.307 0.759341    
trans8        0.7702     1.4517   0.531 0.596262    
trans9        1.8758     0.9845   1.905 0.058011 .

所以我现在认为trans1实际上是second级别（"auto(l3)"），因为第一个是参考级别。为了测试这一点，我重新调整了因子，这样我就会看到第一级的系数 ("auto(av)"):

mpg$trans <- relevel(mpg$trans, ref="auto(l3)")
levels(mpg$trans)
"auto(l3)"   "auto(av)"   "auto(l4)"   "auto(l5)"   "auto(l6)"   "auto(s4)"   "auto(s5)"   "auto(s6)"   "manual(m5)" "manual(m6)"

现在我期待看到第一级的系数，第二级的系数不见了（因为那现在是参考级）：

mpg_glm = glm(hwy ~ trans, data = mpg, contrasts = list(trans = contr.sum))
summary(mpg_glm)
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  24.4173     0.7318  33.366  < 2e-16 ***
trans1        2.5827     3.6210   0.713 0.476426    
trans2        3.3827     2.3592   1.434 0.153017    
trans3       -2.4534     0.9157  -2.679 0.007928 ** 
trans4       -3.6993     1.0865  -3.405 0.000784 ***
trans5       -4.4173     2.1743  -2.032 0.043375 *  
trans6        1.2494     2.9866   0.418 0.676105    
trans7        0.9160     2.9866   0.307 0.759341    
trans8        0.7702     1.4517   0.531 0.596262    
trans9        1.8758     0.9845   1.905 0.058011 .

我看到唯一改变的是前2个系数被调换了，所以参考哪个等级？？我在这里错过了什么？

Answer 1

您正在使用 contr.sum，其中所有级别都与最后一个级别进行比较，并且添加了所有系数（截距除外）总和为零的约束。

你可以在里面查看mpg_glm:

mpg_glm = glm(hwy ~ trans, data = mpg, contrasts = list(trans = contr.sum))
mpg_glm$contrasts

           [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
auto(av)      1    0    0    0    0    0    0    0
auto(l3)      0    1    0    0    0    0    0    0
auto(l4)      0    0    1    0    0    0    0    0
auto(l5)      0    0    0    1    0    0    0    0
auto(l6)      0    0    0    0    1    0    0    0
auto(s4)      0    0    0    0    0    1    0    0
auto(s5)      0    0    0    0    0    0    1    0
auto(s6)      0    0    0    0    0    0    0    1
manual(m5)    0    0    0    0    0    0    0    0
manual(m6)   -1   -1   -1   -1   -1   -1   -1   -1
           [,9]
auto(av)      0
auto(l3)      0
auto(l4)      0
auto(l5)      0
auto(l6)      0
auto(s4)      0
auto(s5)      0
auto(s6)      0
manual(m5)    1
manual(m6)   -1

这里有 9 列，它们是模型中的 9 个非截距系数。由于总和为零的约束，它们或多或少是每个级别减去最后一个级别的平均值。最后一层是多余的，这里没有显示：

var_means = with(mpg,tapply(hwy,trans,mean))
cbind(delta_mean = var_means[-length(var_means)]-var_means[length(var_means)],
coef = coef(mpg_glm)[-1])

           delta_mean       coef
auto(av)    3.5894737  3.3827066
auto(l3)    2.7894737  2.5827066
auto(l4)   -2.2466709 -2.4534380
auto(l5)   -3.4925776 -3.6993447
auto(l6)   -4.2105263 -4.4172934
auto(s4)    1.4561404  1.2493733
auto(s5)    1.1228070  0.9160399
auto(s6)    0.9769737  0.7702066
manual(m5)  2.0825771  1.8758101

因此，当您更改第一个级别时，您只是更改了它们出现的顺序，但值并没有改变。您可以轻松查看对比度：

mpg$trans <- relevel(mpg$trans, ref="auto(l3)")
new_glm = glm(hwy ~ trans, data = mpg, contrasts = list(trans = contr.sum))    
new_glm$contrasts
$trans
           [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
auto(l3)      1    0    0    0    0    0    0    0
auto(av)      0    1    0    0    0    0    0    0
auto(l4)      0    0    1    0    0    0    0    0
auto(l5)      0    0    0    1    0    0    0    0
auto(l6)      0    0    0    0    1    0    0    0
auto(s4)      0    0    0    0    0    1    0    0
auto(s5)      0    0    0    0    0    0    1    0
auto(s6)      0    0    0    0    0    0    0    1
manual(m5)    0    0    0    0    0    0    0    0
manual(m6)   -1   -1   -1   -1   -1   -1   -1   -1
           [,9]
auto(l3)      0
auto(av)      0
auto(l4)      0
auto(l5)      0
auto(l6)      0
auto(s4)      0
auto(s5)      0
auto(s6)      0
manual(m5)    1
manual(m6)   -1

对于您所描述的，作为更改参考并具有与该参考相关的其他级别，它应该是 contr.treatment。

重新调整因子和带有效果编码的 glm

Relevel factor and glm with effect coding

r

glm

coefficients

dummy-variable