带有“lm”的线性模型:如何获得预测值总和的预测方差
linear model with `lm`: how to get prediction variance of sum of predicted values
我正在对具有多个预测变量的线性模型的预测值求和,如下例所示,我想计算该总和的组合方差、标准误差和可能的置信区间。
lm.tree <- lm(Volume ~ poly(Girth,2), data = trees)
假设我有一组Girths
:
newdat <- list(Girth = c(10,12,14,16)
我要预测的总数 Volume
:
pr <- predict(lm.tree, newdat, se.fit = TRUE)
total <- sum(pr$fit)
# [1] 111.512
如何获得 total
的方差?
类似的问题是 here (for GAMs),但我不确定如何继续 vcov(lm.trees)
。如果能提供方法参考,我将不胜感激。
你需要得到完整的方差-协方差矩阵,然后对它的所有元素求和。这里是小证明:
这里的证明使用了另一个定理,你可以从Covariance-wikipedia:
中找到
具体来说,我们做的线性变换是一个全1的列矩阵。计算得到的二次形式 as following,所有 x_i
和 x_j
均为 1。
设置
## your model
lm.tree <- lm(Volume ~ poly(Girth, 2), data = trees)
## newdata (a data frame)
newdat <- data.frame(Girth = c(10, 12, 14, 16))
重新实现predict.lm
以计算方差-协方差矩阵
请参阅 了解 predict.lm
的工作原理。下面的小函数 lm_predict
模仿它的作用,除了
- 它不构造置信度或预测区间(但构造非常简单,如该问答中所述);
- 如果
diag = FALSE
; 可以计算预测值的完整方差-协方差矩阵
- 它returns方差(对于预测值和残差),不是标准误差;
- 做不到
type = "terms"
;它只预测响应变量。
lm_predict <- function (lmObject, newdata, diag = TRUE) {
## input checking
if (!inherits(lmObject, "lm")) stop("'lmObject' is not a valid 'lm' object!")
## extract "terms" object from the fitted model, but delete response variable
tm <- delete.response(terms(lmObject))
## linear predictor matrix
Xp <- model.matrix(tm, newdata)
## predicted values by direct matrix-vector multiplication
pred <- c(Xp %*% coef(lmObject))
## efficiently form the complete variance-covariance matrix
QR <- lmObject$qr ## qr object of fitted model
piv <- QR$pivot ## pivoting index
r <- QR$rank ## model rank / numeric rank
if (is.unsorted(piv)) {
## pivoting has been done
B <- forwardsolve(t(QR$qr), t(Xp[, piv]), r)
} else {
## no pivoting is done
B <- forwardsolve(t(QR$qr), t(Xp), r)
}
## residual variance
sig2 <- c(crossprod(residuals(lmObject))) / df.residual(lmObject)
if (diag) {
## return point-wise prediction variance
VCOV <- colSums(B ^ 2) * sig2
} else {
## return full variance-covariance matrix of predicted values
VCOV <- crossprod(B) * sig2
}
list(fit = pred, var.fit = VCOV, df = lmObject$df.residual, residual.var = sig2)
}
我们可以将其输出与predict.lm
的输出进行比较:
predict.lm(lm.tree, newdat, se.fit = TRUE)
#$fit
# 1 2 3 4
#15.31863 22.33400 31.38568 42.47365
#
#$se.fit
# 1 2 3 4
#0.9435197 0.7327569 0.8550646 0.8852284
#
#$df
#[1] 28
#
#$residual.scale
#[1] 3.334785
lm_predict(lm.tree, newdat)
#$fit
#[1] 15.31863 22.33400 31.38568 42.47365
#
#$var.fit ## the square of `se.fit`
#[1] 0.8902294 0.5369327 0.7311355 0.7836294
#
#$df
#[1] 28
#
#$residual.var ## the square of `residual.scale`
#[1] 11.12079
特别是:
oo <- lm_predict(lm.tree, newdat, FALSE)
oo
#$fit
#[1] 15.31863 22.33400 31.38568 42.47365
#
#$var.fit
# [,1] [,2] [,3] [,4]
#[1,] 0.89022938 0.3846809 0.04967582 -0.1147858
#[2,] 0.38468089 0.5369327 0.52828797 0.3587467
#[3,] 0.04967582 0.5282880 0.73113553 0.6582185
#[4,] -0.11478583 0.3587467 0.65821848 0.7836294
#
#$df
#[1] 28
#
#$residual.var
#[1] 11.12079
请注意,方差-协方差矩阵不是以朴素的方式计算的:Xp %*% vcov(lmObject) % t(Xp)
,这很慢。
聚合(总和)
在您的例子中,聚合操作是 oo$fit
中所有值的总和。这个聚合的均值和方差是
sum_mean <- sum(oo$fit) ## mean of the sum
# 111.512
sum_variance <- sum(oo$var.fit) ## variance of the sum
# 6.671575
您可以通过使用模型中的 t 分布和残差自由度进一步构造此聚合值的置信区间 (CI)。
alpha <- 0.95
Qt <- c(-1, 1) * qt((1 - alpha) / 2, lm.tree$df.residual, lower.tail = FALSE)
#[1] -2.048407 2.048407
## %95 CI
sum_mean + Qt * sqrt(sum_variance)
#[1] 106.2210 116.8029
构建预测区间 (PI) 需要进一步考虑残差。
## adjusted variance-covariance matrix
VCOV_adj <- with(oo, var.fit + diag(residual.var, nrow(var.fit)))
## adjusted variance for the aggregation
sum_variance_adj <- sum(VCOV_adj) ## adjusted variance of the sum
## 95% PI
sum_mean + Qt * sqrt(sum_variance_adj)
#[1] 96.86122 126.16268
聚合(一般)
一般的聚合操作可以是oo$fit
:
的线性组合
w[1] * fit[1] + w[2] * fit[2] + w[3] * fit[3] + ...
例如求和运算的权重全部为1;均值操作的所有权重均为 0.25(在 4 个数据的情况下)。这是一个函数,它采用权重向量、显着性水平以及 lm_predict
返回的内容来生成聚合统计信息。
agg_pred <- function (w, predObject, alpha = 0.95) {
## input checing
if (length(w) != length(predObject$fit)) stop("'w' has wrong length!")
if (!is.matrix(predObject$var.fit)) stop("'predObject' has no variance-covariance matrix!")
## mean of the aggregation
agg_mean <- c(crossprod(predObject$fit, w))
## variance of the aggregation
agg_variance <- c(crossprod(w, predObject$var.fit %*% w))
## adjusted variance-covariance matrix
VCOV_adj <- with(predObject, var.fit + diag(residual.var, nrow(var.fit)))
## adjusted variance of the aggregation
agg_variance_adj <- c(crossprod(w, VCOV_adj %*% w))
## t-distribution quantiles
Qt <- c(-1, 1) * qt((1 - alpha) / 2, predObject$df, lower.tail = FALSE)
## names of CI and PI
NAME <- c("lower", "upper")
## CI
CI <- setNames(agg_mean + Qt * sqrt(agg_variance), NAME)
## PI
PI <- setNames(agg_mean + Qt * sqrt(agg_variance_adj), NAME)
## return
list(mean = agg_mean, var = agg_variance, CI = CI, PI = PI)
}
快速测试前面的求和运算:
agg_pred(rep(1, length(oo$fit)), oo)
#$mean
#[1] 111.512
#
#$var
#[1] 6.671575
#
#$CI
# lower upper
#106.2210 116.8029
#
#$PI
# lower upper
# 96.86122 126.16268
以及平均操作的快速测试:
agg_pred(rep(1, length(oo$fit)) / length(oo$fit), oo)
#$mean
#[1] 27.87799
#
#$var
#[1] 0.4169734
#
#$CI
# lower upper
#26.55526 29.20072
#
#$PI
# lower upper
#24.21531 31.54067
备注
此答案已改进,可为 .
提供易于使用的功能
升级(大数据)
This is great! Thank you so much! There is one thing I forgot to mention: in my actual application I need to sum ~300,000 predictions which would create a full variance-covariance matrix which is about ~700GB in size. Do you have any idea if there is a computationally more efficient way to directly get to the sum of the variance-covariance matrix?
感谢 的 OP 提供的非常有用的评论。是的,这是可能的,而且它在计算上也(显着)便宜。目前,lm_predict
形成这样的方差-协方差:
agg_pred
将预测方差(用于构造 CI)计算为二次形式:w'(B'B)w
,并将预测方差(用于构造 PI)计算为另一个二次形式 w'(B'B + D)w
,其中 D
是残差方差的对角矩阵。显然,如果我们融合这两个函数,我们就有了更好的计算策略:
避免B
和B'B
的计算;我们已将所有矩阵-矩阵乘法替换为矩阵-向量乘法。 B
和B'B
没有内存存储;仅适用于 u
这只是一个向量。这是融合的实现。
## this function requires neither `lm_predict` nor `agg_pred`
fast_agg_pred <- function (w, lmObject, newdata, alpha = 0.95) {
## input checking
if (!inherits(lmObject, "lm")) stop("'lmObject' is not a valid 'lm' object!")
if (!is.data.frame(newdata)) newdata <- as.data.frame(newdata)
if (length(w) != nrow(newdata)) stop("length(w) does not match nrow(newdata)")
## extract "terms" object from the fitted model, but delete response variable
tm <- delete.response(terms(lmObject))
## linear predictor matrix
Xp <- model.matrix(tm, newdata)
## predicted values by direct matrix-vector multiplication
pred <- c(Xp %*% coef(lmObject))
## mean of the aggregation
agg_mean <- c(crossprod(pred, w))
## residual variance
sig2 <- c(crossprod(residuals(lmObject))) / df.residual(lmObject)
## efficiently compute variance of the aggregation without matrix-matrix computations
QR <- lmObject$qr ## qr object of fitted model
piv <- QR$pivot ## pivoting index
r <- QR$rank ## model rank / numeric rank
u <- forwardsolve(t(QR$qr), c(crossprod(Xp, w))[piv], r)
agg_variance <- c(crossprod(u)) * sig2
## adjusted variance of the aggregation
agg_variance_adj <- agg_variance + c(crossprod(w)) * sig2
## t-distribution quantiles
Qt <- c(-1, 1) * qt((1 - alpha) / 2, lmObject$df.residual, lower.tail = FALSE)
## names of CI and PI
NAME <- c("lower", "upper")
## CI
CI <- setNames(agg_mean + Qt * sqrt(agg_variance), NAME)
## PI
PI <- setNames(agg_mean + Qt * sqrt(agg_variance_adj), NAME)
## return
list(mean = agg_mean, var = agg_variance, CI = CI, PI = PI)
}
我们来快速测试一下。
## sum opeartion
fast_agg_pred(rep(1, nrow(newdat)), lm.tree, newdat)
#$mean
#[1] 111.512
#
#$var
#[1] 6.671575
#
#$CI
# lower upper
#106.2210 116.8029
#
#$PI
# lower upper
# 96.86122 126.16268
## average operation
fast_agg_pred(rep(1, nrow(newdat)) / nrow(newdat), lm.tree, newdat)
#$mean
#[1] 27.87799
#
#$var
#[1] 0.4169734
#
#$CI
# lower upper
#26.55526 29.20072
#
#$PI
# lower upper
#24.21531 31.54067
是的,答案正确!
我正在对具有多个预测变量的线性模型的预测值求和,如下例所示,我想计算该总和的组合方差、标准误差和可能的置信区间。
lm.tree <- lm(Volume ~ poly(Girth,2), data = trees)
假设我有一组Girths
:
newdat <- list(Girth = c(10,12,14,16)
我要预测的总数 Volume
:
pr <- predict(lm.tree, newdat, se.fit = TRUE)
total <- sum(pr$fit)
# [1] 111.512
如何获得 total
的方差?
类似的问题是 here (for GAMs),但我不确定如何继续 vcov(lm.trees)
。如果能提供方法参考,我将不胜感激。
你需要得到完整的方差-协方差矩阵,然后对它的所有元素求和。这里是小证明:
这里的证明使用了另一个定理,你可以从Covariance-wikipedia:
中找到具体来说,我们做的线性变换是一个全1的列矩阵。计算得到的二次形式 as following,所有 x_i
和 x_j
均为 1。
设置
## your model
lm.tree <- lm(Volume ~ poly(Girth, 2), data = trees)
## newdata (a data frame)
newdat <- data.frame(Girth = c(10, 12, 14, 16))
重新实现predict.lm
以计算方差-协方差矩阵
请参阅 predict.lm
的工作原理。下面的小函数 lm_predict
模仿它的作用,除了
- 它不构造置信度或预测区间(但构造非常简单,如该问答中所述);
- 如果
diag = FALSE
; 可以计算预测值的完整方差-协方差矩阵
- 它returns方差(对于预测值和残差),不是标准误差;
- 做不到
type = "terms"
;它只预测响应变量。
lm_predict <- function (lmObject, newdata, diag = TRUE) {
## input checking
if (!inherits(lmObject, "lm")) stop("'lmObject' is not a valid 'lm' object!")
## extract "terms" object from the fitted model, but delete response variable
tm <- delete.response(terms(lmObject))
## linear predictor matrix
Xp <- model.matrix(tm, newdata)
## predicted values by direct matrix-vector multiplication
pred <- c(Xp %*% coef(lmObject))
## efficiently form the complete variance-covariance matrix
QR <- lmObject$qr ## qr object of fitted model
piv <- QR$pivot ## pivoting index
r <- QR$rank ## model rank / numeric rank
if (is.unsorted(piv)) {
## pivoting has been done
B <- forwardsolve(t(QR$qr), t(Xp[, piv]), r)
} else {
## no pivoting is done
B <- forwardsolve(t(QR$qr), t(Xp), r)
}
## residual variance
sig2 <- c(crossprod(residuals(lmObject))) / df.residual(lmObject)
if (diag) {
## return point-wise prediction variance
VCOV <- colSums(B ^ 2) * sig2
} else {
## return full variance-covariance matrix of predicted values
VCOV <- crossprod(B) * sig2
}
list(fit = pred, var.fit = VCOV, df = lmObject$df.residual, residual.var = sig2)
}
我们可以将其输出与predict.lm
的输出进行比较:
predict.lm(lm.tree, newdat, se.fit = TRUE)
#$fit
# 1 2 3 4
#15.31863 22.33400 31.38568 42.47365
#
#$se.fit
# 1 2 3 4
#0.9435197 0.7327569 0.8550646 0.8852284
#
#$df
#[1] 28
#
#$residual.scale
#[1] 3.334785
lm_predict(lm.tree, newdat)
#$fit
#[1] 15.31863 22.33400 31.38568 42.47365
#
#$var.fit ## the square of `se.fit`
#[1] 0.8902294 0.5369327 0.7311355 0.7836294
#
#$df
#[1] 28
#
#$residual.var ## the square of `residual.scale`
#[1] 11.12079
特别是:
oo <- lm_predict(lm.tree, newdat, FALSE)
oo
#$fit
#[1] 15.31863 22.33400 31.38568 42.47365
#
#$var.fit
# [,1] [,2] [,3] [,4]
#[1,] 0.89022938 0.3846809 0.04967582 -0.1147858
#[2,] 0.38468089 0.5369327 0.52828797 0.3587467
#[3,] 0.04967582 0.5282880 0.73113553 0.6582185
#[4,] -0.11478583 0.3587467 0.65821848 0.7836294
#
#$df
#[1] 28
#
#$residual.var
#[1] 11.12079
请注意,方差-协方差矩阵不是以朴素的方式计算的:Xp %*% vcov(lmObject) % t(Xp)
,这很慢。
聚合(总和)
在您的例子中,聚合操作是 oo$fit
中所有值的总和。这个聚合的均值和方差是
sum_mean <- sum(oo$fit) ## mean of the sum
# 111.512
sum_variance <- sum(oo$var.fit) ## variance of the sum
# 6.671575
您可以通过使用模型中的 t 分布和残差自由度进一步构造此聚合值的置信区间 (CI)。
alpha <- 0.95
Qt <- c(-1, 1) * qt((1 - alpha) / 2, lm.tree$df.residual, lower.tail = FALSE)
#[1] -2.048407 2.048407
## %95 CI
sum_mean + Qt * sqrt(sum_variance)
#[1] 106.2210 116.8029
构建预测区间 (PI) 需要进一步考虑残差。
## adjusted variance-covariance matrix
VCOV_adj <- with(oo, var.fit + diag(residual.var, nrow(var.fit)))
## adjusted variance for the aggregation
sum_variance_adj <- sum(VCOV_adj) ## adjusted variance of the sum
## 95% PI
sum_mean + Qt * sqrt(sum_variance_adj)
#[1] 96.86122 126.16268
聚合(一般)
一般的聚合操作可以是oo$fit
:
w[1] * fit[1] + w[2] * fit[2] + w[3] * fit[3] + ...
例如求和运算的权重全部为1;均值操作的所有权重均为 0.25(在 4 个数据的情况下)。这是一个函数,它采用权重向量、显着性水平以及 lm_predict
返回的内容来生成聚合统计信息。
agg_pred <- function (w, predObject, alpha = 0.95) {
## input checing
if (length(w) != length(predObject$fit)) stop("'w' has wrong length!")
if (!is.matrix(predObject$var.fit)) stop("'predObject' has no variance-covariance matrix!")
## mean of the aggregation
agg_mean <- c(crossprod(predObject$fit, w))
## variance of the aggregation
agg_variance <- c(crossprod(w, predObject$var.fit %*% w))
## adjusted variance-covariance matrix
VCOV_adj <- with(predObject, var.fit + diag(residual.var, nrow(var.fit)))
## adjusted variance of the aggregation
agg_variance_adj <- c(crossprod(w, VCOV_adj %*% w))
## t-distribution quantiles
Qt <- c(-1, 1) * qt((1 - alpha) / 2, predObject$df, lower.tail = FALSE)
## names of CI and PI
NAME <- c("lower", "upper")
## CI
CI <- setNames(agg_mean + Qt * sqrt(agg_variance), NAME)
## PI
PI <- setNames(agg_mean + Qt * sqrt(agg_variance_adj), NAME)
## return
list(mean = agg_mean, var = agg_variance, CI = CI, PI = PI)
}
快速测试前面的求和运算:
agg_pred(rep(1, length(oo$fit)), oo)
#$mean
#[1] 111.512
#
#$var
#[1] 6.671575
#
#$CI
# lower upper
#106.2210 116.8029
#
#$PI
# lower upper
# 96.86122 126.16268
以及平均操作的快速测试:
agg_pred(rep(1, length(oo$fit)) / length(oo$fit), oo)
#$mean
#[1] 27.87799
#
#$var
#[1] 0.4169734
#
#$CI
# lower upper
#26.55526 29.20072
#
#$PI
# lower upper
#24.21531 31.54067
备注
此答案已改进,可为
升级(大数据)
This is great! Thank you so much! There is one thing I forgot to mention: in my actual application I need to sum ~300,000 predictions which would create a full variance-covariance matrix which is about ~700GB in size. Do you have any idea if there is a computationally more efficient way to directly get to the sum of the variance-covariance matrix?
感谢 lm_predict
形成这样的方差-协方差:
agg_pred
将预测方差(用于构造 CI)计算为二次形式:w'(B'B)w
,并将预测方差(用于构造 PI)计算为另一个二次形式 w'(B'B + D)w
,其中 D
是残差方差的对角矩阵。显然,如果我们融合这两个函数,我们就有了更好的计算策略:
避免B
和B'B
的计算;我们已将所有矩阵-矩阵乘法替换为矩阵-向量乘法。 B
和B'B
没有内存存储;仅适用于 u
这只是一个向量。这是融合的实现。
## this function requires neither `lm_predict` nor `agg_pred`
fast_agg_pred <- function (w, lmObject, newdata, alpha = 0.95) {
## input checking
if (!inherits(lmObject, "lm")) stop("'lmObject' is not a valid 'lm' object!")
if (!is.data.frame(newdata)) newdata <- as.data.frame(newdata)
if (length(w) != nrow(newdata)) stop("length(w) does not match nrow(newdata)")
## extract "terms" object from the fitted model, but delete response variable
tm <- delete.response(terms(lmObject))
## linear predictor matrix
Xp <- model.matrix(tm, newdata)
## predicted values by direct matrix-vector multiplication
pred <- c(Xp %*% coef(lmObject))
## mean of the aggregation
agg_mean <- c(crossprod(pred, w))
## residual variance
sig2 <- c(crossprod(residuals(lmObject))) / df.residual(lmObject)
## efficiently compute variance of the aggregation without matrix-matrix computations
QR <- lmObject$qr ## qr object of fitted model
piv <- QR$pivot ## pivoting index
r <- QR$rank ## model rank / numeric rank
u <- forwardsolve(t(QR$qr), c(crossprod(Xp, w))[piv], r)
agg_variance <- c(crossprod(u)) * sig2
## adjusted variance of the aggregation
agg_variance_adj <- agg_variance + c(crossprod(w)) * sig2
## t-distribution quantiles
Qt <- c(-1, 1) * qt((1 - alpha) / 2, lmObject$df.residual, lower.tail = FALSE)
## names of CI and PI
NAME <- c("lower", "upper")
## CI
CI <- setNames(agg_mean + Qt * sqrt(agg_variance), NAME)
## PI
PI <- setNames(agg_mean + Qt * sqrt(agg_variance_adj), NAME)
## return
list(mean = agg_mean, var = agg_variance, CI = CI, PI = PI)
}
我们来快速测试一下。
## sum opeartion
fast_agg_pred(rep(1, nrow(newdat)), lm.tree, newdat)
#$mean
#[1] 111.512
#
#$var
#[1] 6.671575
#
#$CI
# lower upper
#106.2210 116.8029
#
#$PI
# lower upper
# 96.86122 126.16268
## average operation
fast_agg_pred(rep(1, nrow(newdat)) / nrow(newdat), lm.tree, newdat)
#$mean
#[1] 27.87799
#
#$var
#[1] 0.4169734
#
#$CI
# lower upper
#26.55526 29.20072
#
#$PI
# lower upper
#24.21531 31.54067
是的,答案正确!