忽略 GLM 中的 h2o 因素

Ignoring h2o factor in GLM

当您对分类变量进行单热编码时,通常会在建模之前删除其中一个变量。这样,您就没有线性依赖于其他功能的冗余功能。

有没有办法指定不应在拟合中使用的分类变量水平?

来自the documentation: “我们强烈建议避免将任何级别的分类列一次性编码为许多二进制列,因为这是非常低效的。对于习惯于为其他框架手动扩展分类变量的 Python 用户来说尤其如此。

简短的回答是“不”:你把这个决定留给 H2O,这样它就可以高效地完成。 section just after the one you linked to 解释了原因:

When GLM performs regression (with factor columns), one category can be left out to avoid multicollinearity. If regularization is disabled (lambda = 0), then one category is left out. However, when using a the default lambda parameter, all categories are included.

The reason for the different behavior with regularization is that collinearity is not a problem with regularization. And it’s better to leave regularization to find out which level to ignore (or how to distribute the coefficients between the levels).

顺便说一句,似乎所有其他算法都允许控制分类编码: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/categorical_encoding.html