emmeans 在模型规范和置信区间方面的意外行为
Unexpected behavior of emmeans with respect to model specification and confidence intervals
我的数据是包含许多零的整数。我想使用二项式广义线性模型分别对零点建模。在模型语句中,我在波浪号的左侧指定了 Y>0
,这为我提供了一个二进制 (TRUE
、FALSE
) 向量。我使用指定 (type = "response"
) 的 emmeans
包进一步分析了数据。然后我意识到(根据我的实际数据)置信区间似乎不对。我尝试对此进行故障排除,并决定在我的数据框中分别创建一个包含 TRUE
和 FALSE
值的新变量。这解决了问题。为什么会这样?
下面是重现此行为的代码(尽管其效果不像我的原始数据集中那样明显):
require(emmeans)
# example data
d <- structure(list(X = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L
), .Label = c("A", "B", "C", "D"), class = "factor"), Y = c(0L,
4L, 4L, 5L, 6L, 5L, 6L, 7L, 8L, 9L, 0L, 0L, 3L, 4L, 1L, 5L, 2L,
3L, 2L, 1L, 0L, 0L, 0L, 0L, 0L, 12L, 11L, 6L, 8L, 11L, 0L, 0L,
0L, 0L, 0L, 12L, 13L, 11L, 12L, 16L)), class = "data.frame", row.names = c(NA,
-40L))
# add additional variable - set every value > 0 to TRUE, otherwise FALSE
d$no0 <- d$Y>0
这是在模型中使用关系运算符 >
的第一个模型:
# binomial GLM using `Y>0` on the left side
m1 <- glm(Y>0 ~ X, family = binomial(), d)
summary(m1)
Call:
glm(formula = Y > 0 ~ X, family = binomial(), data = d)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1460 -1.1774 0.4590 0.7954 1.1774
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.1972 1.0540 2.085 0.0371 *
XB -0.8109 1.3175 -0.615 0.5382
XC -2.1972 1.2292 -1.788 0.0739 .
XD -2.1972 1.2292 -1.788 0.0739 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 50.446 on 39 degrees of freedom
Residual deviance: 44.236 on 36 degrees of freedom
AIC: 52.236
Number of Fisher Scoring iterations: 4
这是使用新变量的第二个模型:
# binomial GLM using variable no0
m2 <- glm(no0 ~ X, family = binomial(), d)
summary(m2)
Call:
glm(formula = no0 ~ X, family = binomial(), data = d)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1460 -1.1774 0.4590 0.7954 1.1774
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.1972 1.0540 2.085 0.0371 *
XB -0.8109 1.3175 -0.615 0.5382
XC -2.1972 1.2292 -1.788 0.0739 .
XD -2.1972 1.2292 -1.788 0.0739 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 50.446 on 39 degrees of freedom
Residual deviance: 44.236 on 36 degrees of freedom
AIC: 52.236
Number of Fisher Scoring iterations: 4
到目前为止,输出是相同的。然后我继续 运行 模型 1 和模型 2 的 emmeans()
函数 没有 type = "response"
参数:
(em1 <- emmeans(m1, ~ X))
X emmean SE df asymp.LCL asymp.UCL
A 2.20 1.054 Inf 0.131 4.26
B 1.39 0.791 Inf -0.163 2.94
C 0.00 0.632 Inf -1.240 1.24
D 0.00 0.632 Inf -1.240 1.24
Results are given on the logit (not the response) scale.
Confidence level used: 0.95
(em2 <- emmeans(m2, ~ X))
X emmean SE df asymp.LCL asymp.UCL
A 2.20 1.054 Inf 0.131 4.26
B 1.39 0.791 Inf -0.163 2.94
C 0.00 0.632 Inf -1.240 1.24
D 0.00 0.632 Inf -1.240 1.24
Results are given on the logit (not the response) scale.
Confidence level used: 0.95
一切都很好。但是当我添加 type = response
参数时,除了置信区间不同之外,一切看起来都不错(比较下面的两个输出):
(em3 <- emmeans(m1, ~ X, type = "response"))
X response SE df asymp.LCL asymp.UCL
A 0.9 0.0949 Inf 0.714 1.09
B 0.8 0.1265 Inf 0.552 1.05
C 0.5 0.1581 Inf 0.190 0.81
D 0.5 0.1581 Inf 0.190 0.81
Unknown transformation ">": no transformation done
Confidence level used: 0.95
(em4 <- emmeans(m2, ~ X, type = "response"))
X prob SE df asymp.LCL asymp.UCL
A 0.9 0.0949 Inf 0.533 0.986
B 0.8 0.1265 Inf 0.459 0.950
C 0.5 0.1581 Inf 0.225 0.775
D 0.5 0.1581 Inf 0.225 0.775
Confidence level used: 0.95
Intervals are back-transformed from the logit scale
我看到第一个输出中有警告 (Unknown transformation ">": no transformation done
),但为什么它只影响置信区间?
另一个有趣的观察是,当我绘制 emmeans 对象时 没有 plot()
函数中的 comparisons = T
参数它匹配 em3
和 em4
上面不同置信区间的输出:
p1 <- plot(em3, comparisons = F) + scale_x_continuous(limits = c(0,1.1)) + ggtitle("Y>0 ~.; and comparisons = F")
p2 <- plot(em4, comparisons = F) + scale_x_continuous(limits = c(0,1.1)) + ggtitle("no0 ~.; and comparisons = F")
gridExtra::grid.arrange(p1, p2, nrow = 2)
但是 当我添加 comparisons = T
参数时,置信区间现在相同,但是,两者都匹配基于 [=17= 的模型] 模型中的规范(参见 m3
和 em3
)
p3 <- plot(em3, comparisons = T) + scale_x_continuous(limits = c(0,1.1)) + ggtitle("Y>0 ~.; and comparisons = T")
p4 <- plot(em4, comparisons = T) + scale_x_continuous(limits = c(0,1.1))+ ggtitle("no0 ~.; and comparisons = T")
gridExtra::grid.arrange(p3, p4, nrow = 2)
这有点冗长,但我的问题归结为:
我可以在使用emmeans
时结合使用Y>0 ~ X
模型规范,还是我应该先为此创建一个单独的变量?
发生的事情是 emmeans 允许存在 响应转换和 link 函数的情况。这可能很方便,例如,当您使用伽玛族、逆 link 和平方根响应变换拟合模型时。但是,在这种情况下,>
被视为响应转换:
> emm1 <- emmeans(m1, "X")
> str(emm1)
'emmGrid' object with variables:
X = A, B, C, D
Transformation: “logit”
Additional response transformation: “>”
当您指定 type = "response"
时,summary.emmGrid()
会尝试撤消 两个 转换——即尝试将其置于 Y
范围内.您可以只撤消 link 函数,如下所示:
> confint(emm1, type = "unlink")
X response SE df asymp.LCL asymp.UCL
A 0.9 0.0949 Inf 0.533 0.986
B 0.8 0.1265 Inf 0.459 0.950
C 0.5 0.1581 Inf 0.225 0.775
D 0.5 0.1581 Inf 0.225 0.775
Confidence level used: 0.95
Intervals are back-transformed from the logit scale
...或删除第二个转换:
> emm1a <- update(emm1, tran2 = NULL)
> confint(emm1a, type = "response")
X response SE df asymp.LCL asymp.UCL
A 0.9 0.0949 Inf 0.533 0.986
B 0.8 0.1265 Inf 0.459 0.950
C 0.5 0.1581 Inf 0.225 0.775
D 0.5 0.1581 Inf 0.225 0.775
Confidence level used: 0.95
Intervals are back-transformed from the logit scale
在这两种情况下,这里的置信区间都是在 link 尺度上计算的,然后进行反向转换。您在此处看到的其他置信限度是通过反转这些步骤获得的,即使用反向转换结果的标准误差:
> confint(regrid(emm1, transform = "unlink"))
X response SE df asymp.LCL asymp.UCL
A 0.9 0.0949 Inf 0.714 1.09
B 0.8 0.1265 Inf 0.552 1.05
C 0.5 0.1581 Inf 0.190 0.81
D 0.5 0.1581 Inf 0.190 0.81
Results are given on the > (not the response) scale.
Confidence level used: 0.95
我会考虑是否可以进行更改以可靠地确定何时显然不打算进行响应转换。
我的数据是包含许多零的整数。我想使用二项式广义线性模型分别对零点建模。在模型语句中,我在波浪号的左侧指定了 Y>0
,这为我提供了一个二进制 (TRUE
、FALSE
) 向量。我使用指定 (type = "response"
) 的 emmeans
包进一步分析了数据。然后我意识到(根据我的实际数据)置信区间似乎不对。我尝试对此进行故障排除,并决定在我的数据框中分别创建一个包含 TRUE
和 FALSE
值的新变量。这解决了问题。为什么会这样?
下面是重现此行为的代码(尽管其效果不像我的原始数据集中那样明显):
require(emmeans)
# example data
d <- structure(list(X = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L
), .Label = c("A", "B", "C", "D"), class = "factor"), Y = c(0L,
4L, 4L, 5L, 6L, 5L, 6L, 7L, 8L, 9L, 0L, 0L, 3L, 4L, 1L, 5L, 2L,
3L, 2L, 1L, 0L, 0L, 0L, 0L, 0L, 12L, 11L, 6L, 8L, 11L, 0L, 0L,
0L, 0L, 0L, 12L, 13L, 11L, 12L, 16L)), class = "data.frame", row.names = c(NA,
-40L))
# add additional variable - set every value > 0 to TRUE, otherwise FALSE
d$no0 <- d$Y>0
这是在模型中使用关系运算符 >
的第一个模型:
# binomial GLM using `Y>0` on the left side
m1 <- glm(Y>0 ~ X, family = binomial(), d)
summary(m1)
Call:
glm(formula = Y > 0 ~ X, family = binomial(), data = d)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1460 -1.1774 0.4590 0.7954 1.1774
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.1972 1.0540 2.085 0.0371 *
XB -0.8109 1.3175 -0.615 0.5382
XC -2.1972 1.2292 -1.788 0.0739 .
XD -2.1972 1.2292 -1.788 0.0739 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 50.446 on 39 degrees of freedom
Residual deviance: 44.236 on 36 degrees of freedom
AIC: 52.236
Number of Fisher Scoring iterations: 4
这是使用新变量的第二个模型:
# binomial GLM using variable no0
m2 <- glm(no0 ~ X, family = binomial(), d)
summary(m2)
Call:
glm(formula = no0 ~ X, family = binomial(), data = d)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1460 -1.1774 0.4590 0.7954 1.1774
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.1972 1.0540 2.085 0.0371 *
XB -0.8109 1.3175 -0.615 0.5382
XC -2.1972 1.2292 -1.788 0.0739 .
XD -2.1972 1.2292 -1.788 0.0739 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 50.446 on 39 degrees of freedom
Residual deviance: 44.236 on 36 degrees of freedom
AIC: 52.236
Number of Fisher Scoring iterations: 4
到目前为止,输出是相同的。然后我继续 运行 模型 1 和模型 2 的 emmeans()
函数 没有 type = "response"
参数:
(em1 <- emmeans(m1, ~ X))
X emmean SE df asymp.LCL asymp.UCL
A 2.20 1.054 Inf 0.131 4.26
B 1.39 0.791 Inf -0.163 2.94
C 0.00 0.632 Inf -1.240 1.24
D 0.00 0.632 Inf -1.240 1.24
Results are given on the logit (not the response) scale.
Confidence level used: 0.95
(em2 <- emmeans(m2, ~ X))
X emmean SE df asymp.LCL asymp.UCL
A 2.20 1.054 Inf 0.131 4.26
B 1.39 0.791 Inf -0.163 2.94
C 0.00 0.632 Inf -1.240 1.24
D 0.00 0.632 Inf -1.240 1.24
Results are given on the logit (not the response) scale.
Confidence level used: 0.95
一切都很好。但是当我添加 type = response
参数时,除了置信区间不同之外,一切看起来都不错(比较下面的两个输出):
(em3 <- emmeans(m1, ~ X, type = "response"))
X response SE df asymp.LCL asymp.UCL
A 0.9 0.0949 Inf 0.714 1.09
B 0.8 0.1265 Inf 0.552 1.05
C 0.5 0.1581 Inf 0.190 0.81
D 0.5 0.1581 Inf 0.190 0.81
Unknown transformation ">": no transformation done
Confidence level used: 0.95
(em4 <- emmeans(m2, ~ X, type = "response"))
X prob SE df asymp.LCL asymp.UCL
A 0.9 0.0949 Inf 0.533 0.986
B 0.8 0.1265 Inf 0.459 0.950
C 0.5 0.1581 Inf 0.225 0.775
D 0.5 0.1581 Inf 0.225 0.775
Confidence level used: 0.95
Intervals are back-transformed from the logit scale
我看到第一个输出中有警告 (Unknown transformation ">": no transformation done
),但为什么它只影响置信区间?
另一个有趣的观察是,当我绘制 emmeans 对象时 没有 plot()
函数中的 comparisons = T
参数它匹配 em3
和 em4
上面不同置信区间的输出:
p1 <- plot(em3, comparisons = F) + scale_x_continuous(limits = c(0,1.1)) + ggtitle("Y>0 ~.; and comparisons = F")
p2 <- plot(em4, comparisons = F) + scale_x_continuous(limits = c(0,1.1)) + ggtitle("no0 ~.; and comparisons = F")
gridExtra::grid.arrange(p1, p2, nrow = 2)
但是 当我添加 comparisons = T
参数时,置信区间现在相同,但是,两者都匹配基于 [=17= 的模型] 模型中的规范(参见 m3
和 em3
)
p3 <- plot(em3, comparisons = T) + scale_x_continuous(limits = c(0,1.1)) + ggtitle("Y>0 ~.; and comparisons = T")
p4 <- plot(em4, comparisons = T) + scale_x_continuous(limits = c(0,1.1))+ ggtitle("no0 ~.; and comparisons = T")
gridExtra::grid.arrange(p3, p4, nrow = 2)
这有点冗长,但我的问题归结为:
我可以在使用emmeans
时结合使用Y>0 ~ X
模型规范,还是我应该先为此创建一个单独的变量?
发生的事情是 emmeans 允许存在 响应转换和 link 函数的情况。这可能很方便,例如,当您使用伽玛族、逆 link 和平方根响应变换拟合模型时。但是,在这种情况下,>
被视为响应转换:
> emm1 <- emmeans(m1, "X")
> str(emm1)
'emmGrid' object with variables:
X = A, B, C, D
Transformation: “logit”
Additional response transformation: “>”
当您指定 type = "response"
时,summary.emmGrid()
会尝试撤消 两个 转换——即尝试将其置于 Y
范围内.您可以只撤消 link 函数,如下所示:
> confint(emm1, type = "unlink")
X response SE df asymp.LCL asymp.UCL
A 0.9 0.0949 Inf 0.533 0.986
B 0.8 0.1265 Inf 0.459 0.950
C 0.5 0.1581 Inf 0.225 0.775
D 0.5 0.1581 Inf 0.225 0.775
Confidence level used: 0.95
Intervals are back-transformed from the logit scale
...或删除第二个转换:
> emm1a <- update(emm1, tran2 = NULL)
> confint(emm1a, type = "response")
X response SE df asymp.LCL asymp.UCL
A 0.9 0.0949 Inf 0.533 0.986
B 0.8 0.1265 Inf 0.459 0.950
C 0.5 0.1581 Inf 0.225 0.775
D 0.5 0.1581 Inf 0.225 0.775
Confidence level used: 0.95
Intervals are back-transformed from the logit scale
在这两种情况下,这里的置信区间都是在 link 尺度上计算的,然后进行反向转换。您在此处看到的其他置信限度是通过反转这些步骤获得的,即使用反向转换结果的标准误差:
> confint(regrid(emm1, transform = "unlink"))
X response SE df asymp.LCL asymp.UCL
A 0.9 0.0949 Inf 0.714 1.09
B 0.8 0.1265 Inf 0.552 1.05
C 0.5 0.1581 Inf 0.190 0.81
D 0.5 0.1581 Inf 0.190 0.81
Results are given on the > (not the response) scale.
Confidence level used: 0.95
我会考虑是否可以进行更改以可靠地确定何时显然不打算进行响应转换。