H2o GLM 仅与某些预测因子交互

Question

我有兴趣在 h2o.glm() 中创建交互项。但我不想生成所有成对交互。例如，在 mtcars 数据集中...我想将 'mpg' 与所有其他因素（例如 'cyl'、'hp' 和 'disp' 进行交互，但我不想其他因素相互影响（所以我不想要 disp_hp 或 disp_cyl）。

我应该如何使用 h2o.glm() 中的 (interactions = interactions_list) 参数最好地解决这个问题？

谢谢

Answer 1

根据 ?h2o.glm，interactions= 参数采用：

A list of predictor column indices to interact. All pairwise combinations will be computed for the list.

您不需要所有成对组合，只需要特定的组合。

很遗憾，R H2O API 没有提供公式界面。如果是这样，那么就可以通过编程方式指定任意一组交互，就像在普通的 R glm 中一样。¹

选项 1：使用 `beta_constraints`

一个解决方案是在模型中包含所有成对组合，然后抑制那些你不想要的组合，方法是将 beta 设置为等于0.

根据 glm docs，beta_constraints= 用于：

Specify a dataset to use beta constraints. The selected frame is used to constraint the coefficient vector to provide upper and lower bounds. The dataset must contain a names column with valid coefficient names.

根据 H2O Glossary，beta_constraints 的格式为：

A data.frame or H2OParsedData object with the columns [“names”, “lower_bounds”,”upper_bounds”, “beta_given”], where each row corresponds to a predictor in the GLM. “names” contains the predictor names, “lower_bounds” and “upper_bounds” are the lower and upper bounds of beta, and “beta_given” is some supplied starting values for beta.

现在我们知道如何填写我们的 beta_constraints 数据框除了如何格式化交互项名称。 doc on interactions 并没有告诉我们会发生什么。因此，让我们运行一个通过 H2O 进行交互的示例，并查看交互的名称。

library('h2o')
remoteH2O <- h2o.init(ip='xxx.xx.xx.xxx', startH2O=FALSE)

data(mtcars)

df1 <- as.h2o(mtcars, destination_frame = 'demo_mtcars')

target <- 'wt'
predictors <- c('mpg','cyl','hp','disp')

glm1 <- h2o.glm(x = predictors,
                y = target,
                training_frame = 'demo_mtcars',
                model_id = 'demo_glm',
                lambda = 0, # disable regularization, but your use case may vary
                standardize = FALSE, # we want to see the raw parameters, but your use case may vary
                interactions = predictors # create all interactions
                )
print(glm1) # output includes:
# Coefficients: glm coefficients
#        names coefficients
# 1  Intercept     4.336269
# 2    mpg_cyl     0.019558
# 3     mpg_hp     0.000156
# ..

所以看起来交互项的命名类似于 v1_v2。

所以让我们命名所有我们想要抑制的交互术语，使用 setdiff() 反对我们想要保留的术语。

library(tidyr)
intx_terms_keep <- # see footnote 1 for explanation
  expand.grid(c('mpg'),c('cyl','hp','disp')) %>%
    unite(intx, Var1, Var2, sep='_') %>% unlist()

intx_terms_suppress <- setdiff( # suppress all interactions minus those we wish to keep
                             combn(predictors,2,FUN=paste,collapse='_'), 
                             intx_terms_keep
                            )
constraints <- data.frame(names=intx_terms_suppress, 
                          lower_bounds=0, 
                          upper_bounds=0, 
                          beta_given=0)

glm2 <- h2o.glm(x = predictors,
                y = target,
                training_frame = 'demo_mtcars',
                model_id = 'demo_glm',
                lambda = 0,
                standardize = FALSE, 
                interactions = predictors, # create all interactions
                beta_constraints = constraints
)
print(glm2) # output includes:
# Coefficients: glm coefficients
#        names coefficients
# 1  Intercept     3.405154
# 2    mpg_cyl    -0.012740
# 3     mpg_hp    -0.000250
# 4   mpg_disp     0.000066
# 5     cyl_hp     0.000000
# 6   cyl_disp     0.000000
# 7    hp_disp     0.000000
# 8        mpg    -0.018981
# 9        cyl     0.168820
# 10      disp     0.004070
# 11        hp     0.000501

如您所见，只有所需的交互项具有非零系数。其余的被有效地忽略了。 但是，由于它们仍然是模型中的项，因此它们可能会计入自由度并可能影响某些指标（即调整后的 R 平方）。

选项 2：预先创建交互项

正如@Darren Cook 提到的，另一种解决方案是在训练数据集中预先创建交互作为变量。

这种方法将确保不需要的交互不计入自由度并影响调整后的 R 平方。

¹ 香草的替代非 H2O 溶液 `glm` 公式界面

在允许公式界面的普通 R glm() 中，我会使用 expand.grid 创建一串交互项并将其包含在公式中。

传递 expand.grid 两个向量 -- 您想要将 v1 中的所有项与 v2 中的所有项进行交互。

要使用您的示例，您希望 mpg 与 cyl、hp 和 disp 进行交互：

library(tidyr)
intx_term_string <- 
  expand.grid(c('mpg'),c('cyl','hp','disp')) %>%
    unite(intx, Var1, Var2, sep=':') %>% apply(2, paste, collapse='+')

这会为您提供一串交互项，例如 "mpg:cyl+mpg:hp+mpg:disp"，您可以将其粘贴到一串其他预测变量中（可能使用粘贴折叠）并使用 as.formula().

进行转换

H2o GLM 仅与某些预测因子交互

H2o GLM interact only certain predictors

r

glm

h2o

one-hot-encoding

选项 1：使用 `beta_constraints`

选项 2：预先创建交互项

¹ 香草的替代非 H2O 溶液 `glm` 公式界面

H2o GLM 仅与某些预测因子交互

H2o GLM interact only certain predictors

r

glm

h2o

one-hot-encoding

选项 1：使用 beta_constraints

选项 2：预先创建交互项

1 香草的替代非 H2O 溶液 glm 公式界面

选项 1：使用 `beta_constraints`

¹ 香草的替代非 H2O 溶液 `glm` 公式界面