"gp" 更平滑的 GAM:在新位置进行预测
GAM with "gp" smoother: predict at new locations
我正在使用以下地理加性模型
library(gamair)
library(mgcv)
data(mack)
mack$log.net.area <- log(mack$net.area)
gm2 <- gam(egg.count ~ s(lon,lat,bs="gp",k=100,m=c(2,10,1)) +
s(I(b.depth^.5)) +
s(c.dist) +
s(temp.20m) +
offset(log.net.area),
data = mack, family = tw, method = "REML")
我如何使用它来预测新位置 (lon/lat)
的 egg.count
的值,我没有协变量数据,如 kriging
?
例如,我想在这些新位置
预测egg.count
lon lat
1 -3.00 44
4 -2.75 44
7 -2.50 44
10 -2.25 44
13 -2.00 44
16 -1.75 44
但这里我不知道协变量的值(b.depth
、c.dist
、temp.20m
、log.net.area
)。
predict
仍然要求模型中使用的所有变量都在 newdata
中显示,但您可以将一些任意值(例如 0
s)传递给您不使用的那些协变量' 有,然后使用 type = "terms"
和 terms = name_of_the_wanted_smooth_term
继续。使用
sapply(gm2$smooth, "[[", "label")
#[1] "s(lon,lat)" "s(I(b.depth^0.5))" "s(c.dist)"
#[4] "s(temp.20m)"
检查模型中有哪些平滑项。
## new spatial locations to predict
newdat <- read.table(text = "lon lat
1 -3.00 44
4 -2.75 44
7 -2.50 44
10 -2.25 44
13 -2.00 44
16 -1.75 44")
## "garbage" values, just to pass the variable names checking in `predict.gam`
newdat[c("b.depth", "c.dist", "temp.20m", "log.net.area")] <- 0
## prediction on the link scale
pred_link <- predict(gm2, newdata = newdat, type = "terms", terms = "s(lon,lat)")
# s(lon,lat)
#1 -1.9881967
#4 -1.9137971
#7 -1.6365945
#10 -1.1247837
#13 -0.7910023
#16 -0.7234683
#attr(,"constant")
#(Intercept)
# 2.553535
## simplify to vector
pred_link <- attr(pred_link, "constant") + rowSums(pred_link)
#[1] 0.5653381 0.6397377 0.9169403 1.4287511 1.7625325 1.8300665
## prediction on the response scale
pred_response <- gm2$family$linkinv(pred_link)
#[1] 1.760043 1.895983 2.501625 4.173484 5.827176 6.234301
如果我想对特定的平滑项进行预测,我通常不会使用predict.gam
。 predict.gam
的逻辑是先对所有term做预测,也就是和你做的type = "terms"
一样。然后
- 如果
type = "link"
,对所有逐项预测加上截距(可能 offset
)进行 rowSums
;
- 如果
type = "terms"
和"terms"
或"exclude"
未指定,return结果原样;
- 如果
type = "terms"
并且您指定了 "terms"
和/或 "exclude"
,一些 post-process 会删除您不想要的条款,只提供你想要的人。
因此,predict.gam
将始终对所有项进行计算,即使您只需要一个项。
知道这背后的低效率,这就是我要做的:
sm <- gm2$smooth[[1]] ## extract smooth construction info for `s(lon,lat)`
Xp <- PredictMat(sm, newdat) ## predictor matrix
b <- gm2$coefficients[with(sm, first.para:last.para)] ## coefficients for this term
pred_link <- c(Xp %*% b) + gm2$coef[[1]] ## this term + intercept
#[1] 0.5653381 0.6397377 0.9169403 1.4287511 1.7625325 1.8300665
pred_response <- gm2$family$linkinv(pred_link)
#[1] 1.760043 1.895983 2.501625 4.173484 5.827176 6.234301
你看,我们得到了相同的结果。
Won't the result depend the way on the value assigned to the covariates (here 0)?
将根据这些垃圾值进行一些垃圾预测,但 predict.gam
最终会丢弃它们。
Thanks, you are right. I am not totally sure to understand why then there is the option to add the covariates values at new locations.
我觉得代码维护对于像mgcv
这样的大包来说是非常困难的。如果您希望它适合每个用户的需要,则需要对代码进行重大更改。显然,当像您这样的人只希望它预测某个平滑时,我在这里描述的 predict.gam
逻辑将是低效的。理论上如果是这种情况,检查 newdata
中的变量名可以忽略用户不需要的那些术语。但是,这需要对 predict.gam
进行重大更改,并且可能会因代码更改而引入许多错误。此外,您必须向 CRAN 提交更改日志,而 CRAN 可能不会高兴看到这种剧烈的变化。
西蒙曾分享过他的感受:很多人告诉我,我应该把mgcv
写成这样或那样,但我就是做不到。是的,对像他这样的包作者/维护者表示同情。
Thanks for the update answer. However, I don't understand why the predictions don't depend on the values of the covariates at the new locations.
这取决于您是否为 b.depth
、c.dist
、temp.20m
、log.net.area
提供协变量值。但是由于您没有在新位置使用它们,因此预测只是假设这些影响是 0
.
OK thanks I see now! So would it be correct to say that in the absence of covariate values at new locations I am only predicting the response from the spatial autocorrelation of the residuals?
你只是在预测空间场/平滑。在 GAM 方法中,空间场被建模为均值的一部分,而不是方差-协方差(如克里金法),因此我认为您在这里使用 "residuals" 是不正确的。
Yes, you are right. Just to understand what this code does: would it be correct to say that I am predicting how the response changes over space but not its actual values at the new locations (since for that I would need the values of the covariates at these locations)?
正确。您可以尝试使用 predict.gam
或不使用 terms = "s(lon,lat)"
来帮助您消化输出。当您改变传递给其他协变量的垃圾值时,看看它是如何变化的。
## a possible set of garbage values for covariates
newdat[c("b.depth", "c.dist", "temp.20m", "log.net.area")] <- 0
predict(gm2, newdat, type = "terms")
# s(lon,lat) s(I(b.depth^0.5)) s(c.dist) s(temp.20m)
#1 -1.9881967 -1.05514 0.4739174 -1.466549
#4 -1.9137971 -1.05514 0.4739174 -1.466549
#7 -1.6365945 -1.05514 0.4739174 -1.466549
#10 -1.1247837 -1.05514 0.4739174 -1.466549
#13 -0.7910023 -1.05514 0.4739174 -1.466549
#16 -0.7234683 -1.05514 0.4739174 -1.466549
#attr(,"constant")
#(Intercept)
# 2.553535
predict(gm2, newdat, type = "terms", terms = "s(lon,lat)")
# s(lon,lat)
#1 -1.9881967
#4 -1.9137971
#7 -1.6365945
#10 -1.1247837
#13 -0.7910023
#16 -0.7234683
#attr(,"constant")
#(Intercept)
# 2.553535
## another possible set of garbage values for covariates
newdat[c("b.depth", "c.dist", "temp.20m", "log.net.area")] <- 1
# s(lon,lat) s(I(b.depth^0.5)) s(c.dist) s(temp.20m)
#1 -1.9881967 -0.9858522 -0.3749018 -1.269878
#4 -1.9137971 -0.9858522 -0.3749018 -1.269878
#7 -1.6365945 -0.9858522 -0.3749018 -1.269878
#10 -1.1247837 -0.9858522 -0.3749018 -1.269878
#13 -0.7910023 -0.9858522 -0.3749018 -1.269878
#16 -0.7234683 -0.9858522 -0.3749018 -1.269878
#attr(,"constant")
#(Intercept)
# 2.553535
predict(gm2, newdat, type = "terms", terms = "s(lon,lat)")
# s(lon,lat)
#1 -1.9881967
#4 -1.9137971
#7 -1.6365945
#10 -1.1247837
#13 -0.7910023
#16 -0.7234683
#attr(,"constant")
#(Intercept)
# 2.553535
我正在使用以下地理加性模型
library(gamair)
library(mgcv)
data(mack)
mack$log.net.area <- log(mack$net.area)
gm2 <- gam(egg.count ~ s(lon,lat,bs="gp",k=100,m=c(2,10,1)) +
s(I(b.depth^.5)) +
s(c.dist) +
s(temp.20m) +
offset(log.net.area),
data = mack, family = tw, method = "REML")
我如何使用它来预测新位置 (lon/lat)
的 egg.count
的值,我没有协变量数据,如 kriging
?
例如,我想在这些新位置
预测egg.count
lon lat
1 -3.00 44
4 -2.75 44
7 -2.50 44
10 -2.25 44
13 -2.00 44
16 -1.75 44
但这里我不知道协变量的值(b.depth
、c.dist
、temp.20m
、log.net.area
)。
predict
仍然要求模型中使用的所有变量都在 newdata
中显示,但您可以将一些任意值(例如 0
s)传递给您不使用的那些协变量' 有,然后使用 type = "terms"
和 terms = name_of_the_wanted_smooth_term
继续。使用
sapply(gm2$smooth, "[[", "label")
#[1] "s(lon,lat)" "s(I(b.depth^0.5))" "s(c.dist)"
#[4] "s(temp.20m)"
检查模型中有哪些平滑项。
## new spatial locations to predict
newdat <- read.table(text = "lon lat
1 -3.00 44
4 -2.75 44
7 -2.50 44
10 -2.25 44
13 -2.00 44
16 -1.75 44")
## "garbage" values, just to pass the variable names checking in `predict.gam`
newdat[c("b.depth", "c.dist", "temp.20m", "log.net.area")] <- 0
## prediction on the link scale
pred_link <- predict(gm2, newdata = newdat, type = "terms", terms = "s(lon,lat)")
# s(lon,lat)
#1 -1.9881967
#4 -1.9137971
#7 -1.6365945
#10 -1.1247837
#13 -0.7910023
#16 -0.7234683
#attr(,"constant")
#(Intercept)
# 2.553535
## simplify to vector
pred_link <- attr(pred_link, "constant") + rowSums(pred_link)
#[1] 0.5653381 0.6397377 0.9169403 1.4287511 1.7625325 1.8300665
## prediction on the response scale
pred_response <- gm2$family$linkinv(pred_link)
#[1] 1.760043 1.895983 2.501625 4.173484 5.827176 6.234301
如果我想对特定的平滑项进行预测,我通常不会使用predict.gam
。 predict.gam
的逻辑是先对所有term做预测,也就是和你做的type = "terms"
一样。然后
- 如果
type = "link"
,对所有逐项预测加上截距(可能offset
)进行rowSums
; - 如果
type = "terms"
和"terms"
或"exclude"
未指定,return结果原样; - 如果
type = "terms"
并且您指定了"terms"
和/或"exclude"
,一些 post-process 会删除您不想要的条款,只提供你想要的人。
因此,predict.gam
将始终对所有项进行计算,即使您只需要一个项。
知道这背后的低效率,这就是我要做的:
sm <- gm2$smooth[[1]] ## extract smooth construction info for `s(lon,lat)`
Xp <- PredictMat(sm, newdat) ## predictor matrix
b <- gm2$coefficients[with(sm, first.para:last.para)] ## coefficients for this term
pred_link <- c(Xp %*% b) + gm2$coef[[1]] ## this term + intercept
#[1] 0.5653381 0.6397377 0.9169403 1.4287511 1.7625325 1.8300665
pred_response <- gm2$family$linkinv(pred_link)
#[1] 1.760043 1.895983 2.501625 4.173484 5.827176 6.234301
你看,我们得到了相同的结果。
Won't the result depend the way on the value assigned to the covariates (here 0)?
将根据这些垃圾值进行一些垃圾预测,但 predict.gam
最终会丢弃它们。
Thanks, you are right. I am not totally sure to understand why then there is the option to add the covariates values at new locations.
我觉得代码维护对于像mgcv
这样的大包来说是非常困难的。如果您希望它适合每个用户的需要,则需要对代码进行重大更改。显然,当像您这样的人只希望它预测某个平滑时,我在这里描述的 predict.gam
逻辑将是低效的。理论上如果是这种情况,检查 newdata
中的变量名可以忽略用户不需要的那些术语。但是,这需要对 predict.gam
进行重大更改,并且可能会因代码更改而引入许多错误。此外,您必须向 CRAN 提交更改日志,而 CRAN 可能不会高兴看到这种剧烈的变化。
西蒙曾分享过他的感受:很多人告诉我,我应该把mgcv
写成这样或那样,但我就是做不到。是的,对像他这样的包作者/维护者表示同情。
Thanks for the update answer. However, I don't understand why the predictions don't depend on the values of the covariates at the new locations.
这取决于您是否为 b.depth
、c.dist
、temp.20m
、log.net.area
提供协变量值。但是由于您没有在新位置使用它们,因此预测只是假设这些影响是 0
.
OK thanks I see now! So would it be correct to say that in the absence of covariate values at new locations I am only predicting the response from the spatial autocorrelation of the residuals?
你只是在预测空间场/平滑。在 GAM 方法中,空间场被建模为均值的一部分,而不是方差-协方差(如克里金法),因此我认为您在这里使用 "residuals" 是不正确的。
Yes, you are right. Just to understand what this code does: would it be correct to say that I am predicting how the response changes over space but not its actual values at the new locations (since for that I would need the values of the covariates at these locations)?
正确。您可以尝试使用 predict.gam
或不使用 terms = "s(lon,lat)"
来帮助您消化输出。当您改变传递给其他协变量的垃圾值时,看看它是如何变化的。
## a possible set of garbage values for covariates
newdat[c("b.depth", "c.dist", "temp.20m", "log.net.area")] <- 0
predict(gm2, newdat, type = "terms")
# s(lon,lat) s(I(b.depth^0.5)) s(c.dist) s(temp.20m)
#1 -1.9881967 -1.05514 0.4739174 -1.466549
#4 -1.9137971 -1.05514 0.4739174 -1.466549
#7 -1.6365945 -1.05514 0.4739174 -1.466549
#10 -1.1247837 -1.05514 0.4739174 -1.466549
#13 -0.7910023 -1.05514 0.4739174 -1.466549
#16 -0.7234683 -1.05514 0.4739174 -1.466549
#attr(,"constant")
#(Intercept)
# 2.553535
predict(gm2, newdat, type = "terms", terms = "s(lon,lat)")
# s(lon,lat)
#1 -1.9881967
#4 -1.9137971
#7 -1.6365945
#10 -1.1247837
#13 -0.7910023
#16 -0.7234683
#attr(,"constant")
#(Intercept)
# 2.553535
## another possible set of garbage values for covariates
newdat[c("b.depth", "c.dist", "temp.20m", "log.net.area")] <- 1
# s(lon,lat) s(I(b.depth^0.5)) s(c.dist) s(temp.20m)
#1 -1.9881967 -0.9858522 -0.3749018 -1.269878
#4 -1.9137971 -0.9858522 -0.3749018 -1.269878
#7 -1.6365945 -0.9858522 -0.3749018 -1.269878
#10 -1.1247837 -0.9858522 -0.3749018 -1.269878
#13 -0.7910023 -0.9858522 -0.3749018 -1.269878
#16 -0.7234683 -0.9858522 -0.3749018 -1.269878
#attr(,"constant")
#(Intercept)
# 2.553535
predict(gm2, newdat, type = "terms", terms = "s(lon,lat)")
# s(lon,lat)
#1 -1.9881967
#4 -1.9137971
#7 -1.6365945
#10 -1.1247837
#13 -0.7910023
#16 -0.7234683
#attr(,"constant")
#(Intercept)
# 2.553535