忽略 GLM 中的 h2o 因素

Ignoring h2o factor in GLM

当您对分类变量进行单热编码时，通常会在建模之前删除其中一个变量。这样，您就没有线性依赖于其他功能的冗余功能。

有没有办法指定不应在拟合中使用的分类变量水平？

来自the documentation： “我们强烈建议避免将任何级别的分类列一次性编码为许多二进制列，因为这是非常低效的。对于习惯于为其他框架手动扩展分类变量的 Python 用户来说尤其如此。

简短的回答是“不”：你把这个决定留给 H2O，这样它就可以高效地完成。 section just after the one you linked to 解释了原因：

When GLM performs regression (with factor columns), one category can be left out to avoid multicollinearity. If regularization is disabled (lambda = 0), then one category is left out. However, when using a the default lambda parameter, all categories are included.

The reason for the different behavior with regularization is that collinearity is not a problem with regularization. And it’s better to leave regularization to find out which level to ignore (or how to distribute the coefficients between the levels).

顺便说一句，似乎所有其他算法都允许控制分类编码： http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/categorical_encoding.html