mgcv GAM:“by”参数中有多个变量(平滑变化超过 1 个因子)

mgcv GAM: more than one variable in `by` argument (smooth varying by more than 1 factor)

我需要对不止一个因素的平滑项建模。 by 参数允许我为每个因子水平建模一个平滑模型,但我找不到如何在多个因子上做到这一点。

我尝试了类似于以下的解决方案,但没有成功:

data <- iris
data$factor2 <- rep(c("A", "B"), 75)

mgcv::gam(Sepal.Length ~ s(Petal.Length, by = c(Species, factor2)), data = data)
#> Error in model.frame.default(formula = Sepal.Length ~ 1 + Petal.Length + : variable lengths differ (found for 'c(Species, factor2)')

reprex package (v2.0.0)

于 2021-08-05 创建

欢迎任何帮助!

您可以使用 interaction 函数创建一个具有 6 个级别的交互变量:

data$Sp.by.fac2 <- with(data, interaction(Species,factor2,
                                   drop=TRUE)) # 'drop' addresses Gavin's concern
library(mgcv)
gam(Sepal.Length ~ s(Petal.Length, by = Sp.by.fac2), data = data)

Family: gaussian 
Link function: identity 

Formula:
Sepal.Length ~ s(Petal.Length, by = Sp.by.fac2)

Estimated degrees of freedom:
1.42 3.04 1.00 1.82 2.28 1.00  total = 11.55 

GCV score: 0.1209707 

我在使用 rms+Hmisc 包组合构建有意义的二维样条曲面方面也取得了成功。那些用格子图绘制得很好。

You can make an interaction variable

手动创建交互变量的问题是它使后处理(解释和绘图)过程复杂化。

我尝试在公式中指定两个单独的平滑项,returns 不同的平滑项针对不同的级别组合。但它似乎(毫不奇怪)它没有考虑相互作用,即它计算了 by=Speciesby=factor2.

的“主效应”
set.seed(1)

data <- iris
data$factor2 <- as.factor(sample(c("A", "A", "B"), 150, replace = T))
data$Petal.Length <- sqrt(data$Petal.Length)

m <- mgcv::gam(Sepal.Length ~ s(Petal.Length, by = Species) + s(Petal.Length, by = factor2), data = data)

plot(modelbased::estimate_relation(m, length = 100, preserve_range = FALSE))

reprex package (v2.0.0)

于 2021-08-06 创建

我确实尝试过使用张量积,但它似乎不适合以下因素:

m <- mgcv::gam(Sepal.Length ~ t2(Petal.Length, factor2, by = Species), data = data)
#> Error in smooth.construct.cr.smooth.spec(object$margin[[i]], dat, knt): factor2 has insufficient unique values to support 5 knots: reduce k.

reprex package (v2.0.0)

于 2021-08-06 创建

interaction() 造成的问题之一是它改变了模型的矩阵,这意味着模型数据中包含的一些变量发生了变化:

m <- mgcv::gam(body_mass_g ~ s(flipper_length_mm, by = interaction(species, sex)), data = palmerpenguins::penguins)
head(insight::get_data(m))
#>   body_mass_g flipper_length_mm       species    sex
#> 1        3750               181   Adelie.male   male
#> 2        3800               186 Adelie.female female
#> 3        3250               195 Adelie.female female
#> 5        3450               193 Adelie.female female
#> 6        3650               190   Adelie.male   male
#> 7        3625               181 Adelie.female female

reprex package (v2.0.1)

于 2021-08-06 创建

这可能会导致使用后处理函数时出现一些问题,例如用于可视化。

但是,根据 Gavin 和 IRTFM 的回答,可以通过将变量作为固定效应添加到模型中轻松解决这个问题。

这是一个演示,还说明了两个单独的平滑和交互之间的差异:

library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 4.0.5

set.seed(1)

# Create data
data <- data.frame(x = rep(seq(-10, 10, length.out = 500), 2),
                   fac1 = as.factor(rep(c("A", "B", "C"), length.out = 1000)),
                   fac2 = as.factor(rep(c("X", "Y"), each = 500)))
data$y <- data$x^2 + rnorm(nrow(data), sd = 5)
data$y[data$fac1 == "A"] <- sign(data$x[data$fac1 == "A"]) * data$y[data$fac1 == "A"] + 50
data$y[data$fac1 == "B"] <- datawizard::change_scale(data$y[data$fac1 == "B"]^3, c(-50, 100))
data$y[data$fac2 == "X" & data$fac1 == "C"] <- data$y[data$fac2 == "X" & data$fac1 == "C"] - 100
data$y[data$fac2 == "X" & data$fac1 == "B"] <- datawizard::change_scale(data$y[data$fac2 == "X" & data$fac1 == "B"] ^ 2, c(-50, 100))
data$y[data$fac2 == "X" & data$fac1 == "A"] <- datawizard::change_scale(data$y[data$fac2 == "X" & data$fac1 == "A"] * -3, c(0, 100))

# Real trends
ggplot(data, aes(x = x, y = y, color = fac1, shape = fac2)) + 
  geom_point()

# Two smooths
m <- mgcv::gam(y ~ fac1 * fac2 + s(x, by = fac1) + s(x, by = fac2), data = data)
plot(modelbased::estimate_relation(m, length = 100, preserve_range = F))

# Interaction
m <- mgcv::gam(y ~ fac1 * fac2 + s(x, by = interaction(fac1, fac2)), data = data)
plot(modelbased::estimate_relation(m, length = 100, preserve_range = F))

reprex package (v2.0.1)

于 2021-08-06 创建

最后一个模型设法恢复了每个因素组合的趋势。