mgcv：获得给定新数据的响应的预测分布（负二项式示例）

Question

在 GAM（和 GLM，就此而言）中，我们正在拟合条件似然模型。因此，在拟合模型后，对于新输入 x 和响应 y，我应该能够计算给定 x 的特定值 y 的预测概率或密度.例如，我可能想这样做以比较各种模型对验证数据的拟合度。在 mgcv 中使用合适的 GAM 是否有方便的方法来做到这一点？否则，我如何找出所使用的密度的确切形式，以便我可以适当地插入参数？

作为具体示例，考虑负二项式 GAM：

## From ?negbin
library(mgcv)
set.seed(3)
n<-400
dat <- gamSim(1,n=n)
g <- exp(dat$f/5)

## negative binomial data... 
dat$y <- rnbinom(g,size=3,mu=g)

## fit with theta estimation...
b <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),family=nb(),data=dat)

现在我想计算 y=7 的预测概率，给定 x=(.1,.2,.3,.4)。

Answer 1

是的。 mgcv 正在做（经验）贝叶斯估计，所以你可以获得预测分布。对于您的示例，方法如下。

# prediction on the link (with standard error)
l <- predict(b, newdata = data.frame(x0 = 0.1, x1 = 0.2, x2 = 0.3, x3 = 0.4), se.fit = TRUE)

# Under central limit theory in GLM theory, link value is normally distributed
# for negative binomial with `log` link, the response is log-normal
p.mu <- function (mu) dlnorm(mu, l[[1]], l[[2]])

# joint density of `y` and `mu`
p.y.mu <- function (y, mu) dnbinom(y, size = 3, mu = mu) * p.mu(mu)

# marginal probability (not density as negative binomial is discrete) of `y` (integrating out `mu`)
# I have carefully written this function so it can take vector input
p.y <- function (y) {
  scalar.p.y <- function (scalar.y) integrate(p.y.mu, lower = 0, upper = Inf, y = scalar.y)[[1]]
  sapply(y, scalar.p.y)
  }

现在，由于您想要 y = 7 的概率，以指定的新数据为条件，请使用

p.y(7)
# 0.07810065

总的来说，这种数值积分的方法并不容易。例如，如果 sqrt() 等其他 link 函数用于负二项式，响应的分布就不是那么简单（尽管也不难推导）。

现在我提供一种基于抽样的方法，或 Monte Carlo 方法。这与贝叶斯过程最相似。

N <- 1000  # samples size

set.seed(0)

## draw N samples from posterior of `mu`
sample.mu <- b$family$linkinv(rnorm(N, l[[1]], l[[2]]))

## draw N samples from likelihood `Pr(y|mu)`
sample.y <- rnbinom(1000, size = 3, mu = sample.mu)

## Monte Carlo estimation for `Pr(y = 7)`
mean(sample.y == 7)
# 0.076

备注1

请注意，作为经验贝叶斯，上述所有方法都以估计的平滑参数为条件。如果你想要 "full Bayes" 之类的东西，请在 predict().

中设置 unconditional = TRUE

备注2

也许有些人认为解决方案就这么简单：

mu <- predict(b, newdata = data.frame(x0 = 0.1, x1 = 0.2, x2 = 0.3, x3 = 0.4), type = "response")
dnbinom(7, size = 3, mu = mu)

这样的结果以回归系数为条件（假设固定且没有不确定性），因此 mu 变得固定而不是随机的。这是而非预测分布。预测分布会整合模型估计的不确定性。

mgcv：获得给定新数据的响应的预测分布（负二项式示例）

mgcv: obtain predictive distribution of response given new data (negative binomial example)

statistics

regression

r

gam

mgcv