R GAM 通过组参数产生不同的结果

Question

我有一些全年 24 小时的每小时数据，分为 7 个组。当我拟合一个 gam (mgcv::gam) 时，我使用 by= 参数来生成 7 条不同的拟合线 - 它会产生一些看起来很奇怪的拟合。但是，当我将数据子集化为这些组中的一个并再次运行 gam 时，没有使用 by=Group 参数，拟合看起来好多了并且有意义。

这是一个玩具示例，其中两种方法之间的变化并不那么显着，但使用 by= 参数时我的真实结果要显着得多，为什么会这样？

require(data.table)
require(mgcv)
require(ggplot2)

## create two groups of data, A & B
dtA <- data.table(t = rep(1:12,each=100) , N = c(runif(200, 0.0, 1.0),runif(200, 2.0, 3.0),runif(200, 5.0, 7.0),runif(200, 4.0, 5.0),runif(200, 1.0, 2.0),runif(200, 0.0, 1.0)), Group="A")

dtB <- data.table(t = rep(1:12,each=100) , N = c(runif(200, 20.0, 22.0),runif(200, 14.0, 16.0),runif(200, 6.0, 7.0),runif(200, 5.0, 6.0),runif(200, 12.0, 15.0),runif(200, 17.0, 20.0)), Group="B")

## put the data together, set the group as a factor
dt_gp <- rbindlist(list(dtA,dtB), use.names = T)
dt_gp[, Group := factor(Group, levels=c("A","B"))]

## create the gam , using the by grouping, and then fit to a blank table
gam1 <- gam(N ~ s(t,k=8, bs="cc", by=Group), data = dt_gp)

dt_fit1 <- data.table(t=rep(c(1:12),2), Group=rep(c("A","B"), each=12))
dt_fit1[, Group := factor(Group, levels=c("A","B"))]

fits1 = predict(gam1, newdata=dt_fit1, type='response', se=T)
predicts1 = as.data.table(data.frame(dt_fit1, fits1))

## now subset GpA data and run and recreate GAM and fitted line. 
dt <- dt_gp[Group=="A"]
dt[,Group:=NULL]

gam2   <- gam(N ~ s(t,k=8, bs="cc"), data = dt)

dt_fit2 <- data.table(t=1:12)

fits2 = predict(gam2, newdata=dt_fit2, type='response', se=T)
predicts2 = as.data.table(data.frame(dt_fit2, fits2))

## plot to see difference (add Group to 2nd prediction for facet in plot)
predicts2[,Group:="A"]
ggplot()+
  geom_line(data=predicts1, aes(x=t, y=fit), colour="blue")+
  geom_line(data=predicts2, aes(x=t, y=fit), colour="red")+
  geom_point(data=dt_gp, aes(x=t,y=N), colour="grey50")+
  facet_wrap(~Group, nrow=2, scales="free_y")+
  ggtitle("GAM on numbers grouped by A & B (numbers in A identical in both cases)")+
  theme_bw()+
  theme(axis.text.x = element_text(size=12),
        axis.text.y = element_text(size=12),
        axis.title = element_text(size=16),
        legend.title=element_blank())

红线是我分出来的数据，蓝线是分组的。 mgcv::gam()中的分组功能不是把数据分开了吗？随着我制作 A 和 B 的次数越多 'different'，蓝线与原始数据点的吻合度越差。

Answer 1

来自 mgcv 中 s 函数的文档：

In the factor by variable case a replicate of the smooth is produced for each factor level (these smooths will be centered, so the factor usually needs to be added as a main effect as well). See gam.models for further details.

所以看起来您还想在调用 s 之外的公式中包含 Group，例如，

gam1 <- gam(N ~ Group + s(t,k=8, bs="cc", by=Group), data = dt_gp).

R GAM 通过组参数产生不同的结果

R GAM producing different results via group argument

model

r

gam

trend