如何从 R 中具有分类处理的回归模型中获取均值比率的标准误差?
How to get standard errors for ratios of means from a regression model with categorical treatments in R?
我有一些来自间作实验的示例数据,其中包括两种作物 A 和 B 的单作产量和相互间作的产量。在这种情况下,单一栽培的作物表示为与自身的间作作物,即
作物 A 与作物 A = 单一栽培;作物 B 与作物 B = 单一栽培;作物 A 与作物 B = 间作;作物 B 与作物 B = 间作。
我想获得土地当量比的平均值和标准误差,定义为:
LER =(作物A与B间作的产量)/(单作作物A的产量)+(作物A的产量
与B间作)/(单作作物A的产量)
有人建议我使用 SE 的 delta 方法,所以我想我想使用 msm
中的 deltamethod()
函数(但不太确定)
这是数据(更多物种的更大实验的子集):
structure(list(site = c("Two", "Two", "Two", "Two", "Two", "Two",
"Two", "Two", "Two", "Two", "Two", "Two", "One", "One", "One",
"One", "One", "One", "One", "One", "One", "One", "One", "One",
"Three", "Three", "Three", "Three", "Three", "Three", "Three",
"Three", "Three", "Three", "Three", "Three"), siteblock = c("Two_2",
"Two_1", "Two_3", "Two_1", "Two_3", "Two_2", "Two_1", "Two_3",
"Two_2", "Two_2", "Two_3", "Two_1", "One_2", "One_1", "One_3",
"One_3", "One_1", "One_2", "One_3", "One_1", "One_2", "One_2",
"One_3", "One_1", "Three_1", "Three_2", "Three_3", "Three_1",
"Three_2", "Three_3", "Three_1", "Three_2", "Three_3", "Three_1",
"Three_2", "Three_3"), species = c("A", "A", "A", "A", "A", "A",
"B", "B", "B", "B", "B", "B", "A", "A", "A", "A", "A", "A", "B",
"B", "B", "B", "B", "B", "A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B"), pair = c("A", "A", "A", "B", "B", "B", "A",
"A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B", "A", "A",
"A", "B", "B", "B", "A", "A", "A", "B", "B", "B", "A", "A", "A",
"B", "B", "B"), yield.pp = c(118.6648257, 80.29858998, 90.53153963,
48.64188012, 140.3934164, 103.0518444, 18.23102298, 7.104490919,
22.81524354, 21.07202039, 7.529486987, 18.63650567, 242.2567602,
202.1431331, 185.5609192, 283.144789, 241.1690115, 241.5258056,
23.78876862, 35.87524173, 41.1028137, 18.55380809, 22.46060419,
18.46323056, 242.9551749, 231.387521, 455.9878777, 288.2237713,
156.3390735, 207.4286019, 7.167311238, 34.66607289, 22.41394604,
42.22510313, 38.70415176, 57.86653817)), class = "data.frame", row.names = c(NA,
-36L))
该实验在三个地点的三个区块中重复进行,但为了简单起见,我们暂时忽略它。我创建了一个模型来估计每个组合中每个物种的产量:
mod<-lm(yield.pp~species*pair,data=df)
summary(mod)
然后就可以直接得到每种组合中每种作物的估计均值:
est_grid<-data.frame(expand.grid(species=levels(as.factor(df$species)),pair=levels(as.factor(df$pair))))
est_means<-predict(mod,newdata=est_grid)
est_means
## 1 2 3 4
##205.53182 23.68499 189.99091 27.27905
并计算平均土地当量比:
(est_means[3]/est_means[1])+(est_means[2]/est_means[4])
## 3
##1.792635
所以我的问题是如何获得符合平均比率的SE。我目前被 deltamethod
难住了,因为它似乎只为估计模型参数的比率提供 SE,而不是估计均值的比率 ...
使用来自 emmeans 的 ref_grid 创建参考网格,在这种情况下,它将 return 一个 emmGrid 对象并且具有 vcov 方法,这是 运行 来自车包.
library(car)
library(emmeans)
r <- ref_grid(mod); r
## 'emmGrid' object with variables:
## species = A, B
## pair = A, B
s <- summary(r); s
## species pair prediction SE df
## A A 205.5 23.7 32
## B A 23.7 23.7 32
## A B 190.0 23.7 32
## B B 27.3 23.7 32
m <- with(s, setNames(prediction, paste0(species, pair)))
## AA BA AB BB
## 205.53182 23.68499 189.99091 27.27905
deltaMethod(m, "AB / AA + BA / BB", vcov(r))
## Estimate SE 2.5 % 97.5 %
## AB/AA + BA/BB 1.79264 1.15910 -0.47916 4.0644
如果您只需要比率,emmeans 包可以直接提供。关键是将预测重新网格化为对数尺度:
> rlog <- ref_grid(mod, transform = "log")
或者只使用其他答案中的参考网格:rlog <- regrid(r, "log")
。
无论哪种方式,我们都已转换为参考网格,就好像应用了对数变换一样。使用type = "response"
撤销对数转换:
> confint(rlog, type = "response")
species pair response SE df lower.CL upper.CL
A A 205.5 23.7 32 162.58 260
B A 23.7 23.7 32 3.10 181
A B 190.0 23.7 32 147.43 245
B B 27.3 23.7 32 4.66 160
Confidence level used: 0.95
Intervals are back-transformed from the log scale
这些正是@G.Grothendieck的回答得到的结果。现在做两两比较;由于日志的差异是比率的日志,反向转换过程产生比率:
> pairs(rlog, type = "response", infer = c(TRUE, TRUE), adjust = "none")
contrast ratio SE df lower.CL upper.CL null t.ratio p.value
A A / B A 8.678 8.725 32 1.1194 67.268 1 2.149 0.0393
A A / A B 1.082 0.183 32 0.7659 1.528 1 0.464 0.6460
A A / B B 7.534 6.591 32 1.2682 44.763 1 2.309 0.0276
B A / A B 0.125 0.125 32 0.0160 0.969 1 -2.069 0.0467
B A / B B 0.868 1.148 32 0.0587 12.846 1 -0.107 0.9156
A B / B B 6.965 6.102 32 1.1692 41.487 1 2.215 0.0340
Confidence level used: 0.95
Intervals are back-transformed from the log scale
Tests are performed on the log scale
(我添加了 infer
参数以获取 CI 和测试。)
经过一些花哨的操作,甚至可以回答原始问题:
> con = contrast(rlog, list(`AB/AA` = c(-1,0,1,0), `BA/BB` = c(0,1,0,-1)))
> confint(con, type = "r")
contrast ratio SE df lower.CL upper.CL
AB/AA 0.924 0.157 32 0.6544 1.31
BA/BB 0.868 1.148 32 0.0587 12.85
Confidence level used: 0.95
Intervals are back-transformed from the log scale
> confint(contrast(regrid(con), list(`AB/AA + BA/BB` = c(1,1))))
contrast estimate SE df lower.CL upper.CL
AB/AA + BA/BB 1.79 1.16 32 -0.568 4.15
Confidence level used: 0.95
置信限略有不同,因为我们使用 32 d.f。而 deltaMethod
产生渐近结果(无限 d.f。)
我有一些来自间作实验的示例数据,其中包括两种作物 A 和 B 的单作产量和相互间作的产量。在这种情况下,单一栽培的作物表示为与自身的间作作物,即 作物 A 与作物 A = 单一栽培;作物 B 与作物 B = 单一栽培;作物 A 与作物 B = 间作;作物 B 与作物 B = 间作。
我想获得土地当量比的平均值和标准误差,定义为:
LER =(作物A与B间作的产量)/(单作作物A的产量)+(作物A的产量 与B间作)/(单作作物A的产量)
有人建议我使用 SE 的 delta 方法,所以我想我想使用 msm
中的 deltamethod()
函数(但不太确定)
这是数据(更多物种的更大实验的子集):
structure(list(site = c("Two", "Two", "Two", "Two", "Two", "Two",
"Two", "Two", "Two", "Two", "Two", "Two", "One", "One", "One",
"One", "One", "One", "One", "One", "One", "One", "One", "One",
"Three", "Three", "Three", "Three", "Three", "Three", "Three",
"Three", "Three", "Three", "Three", "Three"), siteblock = c("Two_2",
"Two_1", "Two_3", "Two_1", "Two_3", "Two_2", "Two_1", "Two_3",
"Two_2", "Two_2", "Two_3", "Two_1", "One_2", "One_1", "One_3",
"One_3", "One_1", "One_2", "One_3", "One_1", "One_2", "One_2",
"One_3", "One_1", "Three_1", "Three_2", "Three_3", "Three_1",
"Three_2", "Three_3", "Three_1", "Three_2", "Three_3", "Three_1",
"Three_2", "Three_3"), species = c("A", "A", "A", "A", "A", "A",
"B", "B", "B", "B", "B", "B", "A", "A", "A", "A", "A", "A", "B",
"B", "B", "B", "B", "B", "A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B"), pair = c("A", "A", "A", "B", "B", "B", "A",
"A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B", "A", "A",
"A", "B", "B", "B", "A", "A", "A", "B", "B", "B", "A", "A", "A",
"B", "B", "B"), yield.pp = c(118.6648257, 80.29858998, 90.53153963,
48.64188012, 140.3934164, 103.0518444, 18.23102298, 7.104490919,
22.81524354, 21.07202039, 7.529486987, 18.63650567, 242.2567602,
202.1431331, 185.5609192, 283.144789, 241.1690115, 241.5258056,
23.78876862, 35.87524173, 41.1028137, 18.55380809, 22.46060419,
18.46323056, 242.9551749, 231.387521, 455.9878777, 288.2237713,
156.3390735, 207.4286019, 7.167311238, 34.66607289, 22.41394604,
42.22510313, 38.70415176, 57.86653817)), class = "data.frame", row.names = c(NA,
-36L))
该实验在三个地点的三个区块中重复进行,但为了简单起见,我们暂时忽略它。我创建了一个模型来估计每个组合中每个物种的产量:
mod<-lm(yield.pp~species*pair,data=df)
summary(mod)
然后就可以直接得到每种组合中每种作物的估计均值:
est_grid<-data.frame(expand.grid(species=levels(as.factor(df$species)),pair=levels(as.factor(df$pair))))
est_means<-predict(mod,newdata=est_grid)
est_means
## 1 2 3 4
##205.53182 23.68499 189.99091 27.27905
并计算平均土地当量比:
(est_means[3]/est_means[1])+(est_means[2]/est_means[4])
## 3
##1.792635
所以我的问题是如何获得符合平均比率的SE。我目前被 deltamethod
难住了,因为它似乎只为估计模型参数的比率提供 SE,而不是估计均值的比率 ...
使用来自 emmeans 的 ref_grid 创建参考网格,在这种情况下,它将 return 一个 emmGrid 对象并且具有 vcov 方法,这是 运行 来自车包.
library(car)
library(emmeans)
r <- ref_grid(mod); r
## 'emmGrid' object with variables:
## species = A, B
## pair = A, B
s <- summary(r); s
## species pair prediction SE df
## A A 205.5 23.7 32
## B A 23.7 23.7 32
## A B 190.0 23.7 32
## B B 27.3 23.7 32
m <- with(s, setNames(prediction, paste0(species, pair)))
## AA BA AB BB
## 205.53182 23.68499 189.99091 27.27905
deltaMethod(m, "AB / AA + BA / BB", vcov(r))
## Estimate SE 2.5 % 97.5 %
## AB/AA + BA/BB 1.79264 1.15910 -0.47916 4.0644
如果您只需要比率,emmeans 包可以直接提供。关键是将预测重新网格化为对数尺度:
> rlog <- ref_grid(mod, transform = "log")
或者只使用其他答案中的参考网格:rlog <- regrid(r, "log")
。
无论哪种方式,我们都已转换为参考网格,就好像应用了对数变换一样。使用type = "response"
撤销对数转换:
> confint(rlog, type = "response")
species pair response SE df lower.CL upper.CL
A A 205.5 23.7 32 162.58 260
B A 23.7 23.7 32 3.10 181
A B 190.0 23.7 32 147.43 245
B B 27.3 23.7 32 4.66 160
Confidence level used: 0.95
Intervals are back-transformed from the log scale
这些正是@G.Grothendieck的回答得到的结果。现在做两两比较;由于日志的差异是比率的日志,反向转换过程产生比率:
> pairs(rlog, type = "response", infer = c(TRUE, TRUE), adjust = "none")
contrast ratio SE df lower.CL upper.CL null t.ratio p.value
A A / B A 8.678 8.725 32 1.1194 67.268 1 2.149 0.0393
A A / A B 1.082 0.183 32 0.7659 1.528 1 0.464 0.6460
A A / B B 7.534 6.591 32 1.2682 44.763 1 2.309 0.0276
B A / A B 0.125 0.125 32 0.0160 0.969 1 -2.069 0.0467
B A / B B 0.868 1.148 32 0.0587 12.846 1 -0.107 0.9156
A B / B B 6.965 6.102 32 1.1692 41.487 1 2.215 0.0340
Confidence level used: 0.95
Intervals are back-transformed from the log scale
Tests are performed on the log scale
(我添加了 infer
参数以获取 CI 和测试。)
经过一些花哨的操作,甚至可以回答原始问题:
> con = contrast(rlog, list(`AB/AA` = c(-1,0,1,0), `BA/BB` = c(0,1,0,-1)))
> confint(con, type = "r")
contrast ratio SE df lower.CL upper.CL
AB/AA 0.924 0.157 32 0.6544 1.31
BA/BB 0.868 1.148 32 0.0587 12.85
Confidence level used: 0.95
Intervals are back-transformed from the log scale
> confint(contrast(regrid(con), list(`AB/AA + BA/BB` = c(1,1))))
contrast estimate SE df lower.CL upper.CL
AB/AA + BA/BB 1.79 1.16 32 -0.568 4.15
Confidence level used: 0.95
置信限略有不同,因为我们使用 32 d.f。而 deltaMethod
产生渐近结果(无限 d.f。)