mgcv_1.8-24:bam() 的 "fREML" 或 "REML" 方法给出了错误解释的偏差
mgcv_1.8-24: "fREML" or "REML" method of bam() gives wrong explained deviance
使用方法 "fREML" 和 "REML" 用 bam
拟合同一个模型给了我接近的结果,但是解释的偏差与 summary.gam
返回的有很大不同。
"fREML" 的数量约为 3.5%(不好),而 "REML" 的数量约为 50%(还不错)。怎么可能?哪一个是正确的?
很遗憾,我无法提供一个简单的可重现示例。
#######################################
## method = "fREML", discrete = TRUE ##
#######################################
Family: binomial
Link function: logit
Formula:
ObsOrRand ~ s(Var1, k = 3) + s(RandomVar, bs = "re")
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.0026 0.2199 -22.75 <2e-16
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(Var1) 1.00 1.001 17.54 2.82e-05
s(RandomVar) 16.39 19.000 145.03 < 2e-16
R-sq.(adj) = 0.00349 Deviance explained = 3.57%
fREML = 2.8927e+05 Scale est. = 1 n = 312515
########################################
## method = "fREML", discrete = FALSE ##
########################################
Family: binomial
Link function: logit
Formula:
ObsOrRand ~ s(Var1, k = 3) + s(RandomVar, bs = "re")
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.8941 0.2207 -22.18 <2e-16
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(Var1) 1.008 1.016 17.44 3.09e-05
s(RandomVar) 16.390 19.000 144.86 < 2e-16
R-sq.(adj) = 0.00349 Deviance explained = 3.57%
fREML = 3.1556e+05 Scale est. = 1 n = 312515
#####################################################
## method = "REML", discrete method not applicable ##
#####################################################
Family: binomial
Link function: logit
Formula:
ObsOrRand ~ s(Var1, k = 3) + s(RandomVar, bs = "re")
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.8928 0.2205 -22.19 <2e-16
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(Var1) 1.156 1.278 16.57 8.53e-05
s(RandomVar) 16.379 19.000 142.60 < 2e-16
R-sq.(adj) = 0.0035 Deviance explained = 50.8%
-REML = 3.1555e+05 Scale est. = 1 n = 312515
这个问题可以追溯到 mgcv_1.8-23
。它的 changlog 阅读:
* bam extended family extension had introduced a bug in null deviance
computation for Gaussian additive case when using methods other than fREML
or GCV.Cp. Fixed.
现在发现补丁对高斯情况是成功的,但对非高斯情况却不是。
让我先提供一个可重现的例子,因为你的问题没有。
set.seed(0)
x <- runif(1000)
## the linear predictor is a 3rd degree polynomial
p <- binomial()$linkinv(0.5 + poly(x, 3) %*% rnorm(3) * 20)
## p is well spread out on (0, 1); check `hist(p)`
y <- rbinom(1000, 1, p)
library(mgcv)
#Loading required package: nlme
#This is mgcv 1.8-24. For overview type 'help("mgcv-package")'.
fREML <- bam(y ~ s(x, bs = 'cr', k = 8), family = binomial(), method = "fREML")
REML <- bam(y ~ s(x, bs = 'cr', k = 8), family = binomial(), method = "REML")
GCV <- bam(y ~ s(x, bs = 'cr', k = 8), family = binomial(), method = "GCV.Cp")
## explained.deviance = (null.deviance - deviance) / null.deviance
## so in this example we get negative explained deviance for "REML" method
unlist(REML[c("null.deviance", "deviance")])
#null.deviance deviance
# 181.7107 1107.5241
unlist(fREML[c("null.deviance", "deviance")])
#null.deviance deviance
# 1357.936 1107.524
unlist(GCV[c("null.deviance", "deviance")])
#null.deviance deviance
# 1357.936 1108.108
Null deviance 不能小于 deviance(TSS 不能小于 RSS),所以 bam
的 "REML" 方法在这里无法 return 正确的 Null deviance。
我已经在 mgcv_1.8-24/R/bam.r
的第 1350 行找到了问题:
object$family <- object$fitted.values <- NULL
其实应该是
object$null.deviance <- object$fitted.values <- NULL
对于 "GCV.Cp" 和 "fREML" 以外的方法,bam
依赖于 gam
进行估计,在将大型 n x p
模型矩阵缩减为 p x p
矩阵(n
:数据个数;p
:系数个数)。由于这个新模型矩阵没有自然解释,许多 return 由 gam
编辑的量应该无效(除了估计的平滑参数)。 Simon 把 family
.
打错了
我构建了一个补丁版本,结果修复了这个错误。我会告诉西蒙在下一个版本中修复它。
使用方法 "fREML" 和 "REML" 用 bam
拟合同一个模型给了我接近的结果,但是解释的偏差与 summary.gam
返回的有很大不同。
"fREML" 的数量约为 3.5%(不好),而 "REML" 的数量约为 50%(还不错)。怎么可能?哪一个是正确的?
很遗憾,我无法提供一个简单的可重现示例。
#######################################
## method = "fREML", discrete = TRUE ##
#######################################
Family: binomial
Link function: logit
Formula:
ObsOrRand ~ s(Var1, k = 3) + s(RandomVar, bs = "re")
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.0026 0.2199 -22.75 <2e-16
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(Var1) 1.00 1.001 17.54 2.82e-05
s(RandomVar) 16.39 19.000 145.03 < 2e-16
R-sq.(adj) = 0.00349 Deviance explained = 3.57%
fREML = 2.8927e+05 Scale est. = 1 n = 312515
########################################
## method = "fREML", discrete = FALSE ##
########################################
Family: binomial
Link function: logit
Formula:
ObsOrRand ~ s(Var1, k = 3) + s(RandomVar, bs = "re")
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.8941 0.2207 -22.18 <2e-16
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(Var1) 1.008 1.016 17.44 3.09e-05
s(RandomVar) 16.390 19.000 144.86 < 2e-16
R-sq.(adj) = 0.00349 Deviance explained = 3.57%
fREML = 3.1556e+05 Scale est. = 1 n = 312515
#####################################################
## method = "REML", discrete method not applicable ##
#####################################################
Family: binomial
Link function: logit
Formula:
ObsOrRand ~ s(Var1, k = 3) + s(RandomVar, bs = "re")
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.8928 0.2205 -22.19 <2e-16
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(Var1) 1.156 1.278 16.57 8.53e-05
s(RandomVar) 16.379 19.000 142.60 < 2e-16
R-sq.(adj) = 0.0035 Deviance explained = 50.8%
-REML = 3.1555e+05 Scale est. = 1 n = 312515
这个问题可以追溯到 mgcv_1.8-23
。它的 changlog 阅读:
* bam extended family extension had introduced a bug in null deviance
computation for Gaussian additive case when using methods other than fREML
or GCV.Cp. Fixed.
现在发现补丁对高斯情况是成功的,但对非高斯情况却不是。
让我先提供一个可重现的例子,因为你的问题没有。
set.seed(0)
x <- runif(1000)
## the linear predictor is a 3rd degree polynomial
p <- binomial()$linkinv(0.5 + poly(x, 3) %*% rnorm(3) * 20)
## p is well spread out on (0, 1); check `hist(p)`
y <- rbinom(1000, 1, p)
library(mgcv)
#Loading required package: nlme
#This is mgcv 1.8-24. For overview type 'help("mgcv-package")'.
fREML <- bam(y ~ s(x, bs = 'cr', k = 8), family = binomial(), method = "fREML")
REML <- bam(y ~ s(x, bs = 'cr', k = 8), family = binomial(), method = "REML")
GCV <- bam(y ~ s(x, bs = 'cr', k = 8), family = binomial(), method = "GCV.Cp")
## explained.deviance = (null.deviance - deviance) / null.deviance
## so in this example we get negative explained deviance for "REML" method
unlist(REML[c("null.deviance", "deviance")])
#null.deviance deviance
# 181.7107 1107.5241
unlist(fREML[c("null.deviance", "deviance")])
#null.deviance deviance
# 1357.936 1107.524
unlist(GCV[c("null.deviance", "deviance")])
#null.deviance deviance
# 1357.936 1108.108
Null deviance 不能小于 deviance(TSS 不能小于 RSS),所以 bam
的 "REML" 方法在这里无法 return 正确的 Null deviance。
我已经在 mgcv_1.8-24/R/bam.r
的第 1350 行找到了问题:
object$family <- object$fitted.values <- NULL
其实应该是
object$null.deviance <- object$fitted.values <- NULL
对于 "GCV.Cp" 和 "fREML" 以外的方法,bam
依赖于 gam
进行估计,在将大型 n x p
模型矩阵缩减为 p x p
矩阵(n
:数据个数;p
:系数个数)。由于这个新模型矩阵没有自然解释,许多 return 由 gam
编辑的量应该无效(除了估计的平滑参数)。 Simon 把 family
.
我构建了一个补丁版本,结果修复了这个错误。我会告诉西蒙在下一个版本中修复它。