h2o DRF 看不见的分类值处理

h2o DRF unseen categorical values handling

documentation for DRF

What happens when you try to predict on a categorical level not seen during training? DRF converts a new categorical level to a NA value in the test set, and then splits left on the NA value during scoring. The algorithm splits left on NA values because, during training, NA values are grouped with the outliers in the left-most bin.

问题:

  1. 因此 h2o 将看不见的水平转换为 NA,然后像对待训练数据中的 NA 一样对待它们。但是,如果训练数据中也没有 NA 怎么办?
  2. 假设我的分类预测变量是 enum 类型并且被理解为非序数。那么“与最左侧 bin 中的异常值分组”是什么意思?如果预测变量是非有序的,则没有“left-most”并且没有“outliers”。
  3. 我们先把问题1和问题2放在一边,关注“这部分 算法在 NA 值上向左拆分,因为在训练期间,NA 值 与最左侧 bin 中的异常值分组。这与显示从 MOJO 派生的单个 DRF 树的 相矛盾。可以清楚地看到 NA 向左和向右移动。它还与文档中另一个问题的答案相矛盾,该问题说“缺失值作为一个单独的类别[...]可以向左或向右”,请参阅

How does the algorithm handle missing values during training? Missing values are interpreted as containing information (i.e., missing for a reason), rather than missing at random. During tree building, split decisions for every node are found by minimizing the loss function and treating missing values as a separate category that can go either left or right.

最后一点与其说是问题,不如说是建议。 documentation on missing values for GBM

What happens when you try to predict on a categorical level not seen during training? Unseen categorical levels are turned into NAs, and thus follow the same behavior as an NA. If there are no NAs in the training data, then unseen categorical levels in the test data follow the majority direction (the direction with the most observations). If there are NAs in the training data, then unseen categorical levels in the test data follow the direction that is optimal for the NAs of the training data.

对比DRF如何处理缺失值的描述,这似乎是完全一致的。另外:使用多数路径而不是总是在分裂点向左走似乎更自然。

您指出的句子似乎与文档的其他部分相矛盾,实际上已经过时了。我做了一个 Jira Ticket 以使用正确答案更新常见问题解答(这是您在 GBM 缺失值部分看到的内容 - 即 GBM 和 DRF 的缺失值处理相同)。

作为旁注,枚举数据类型在内部编码为数值,您可以在此处了解有关 H2O 可以使用的映射类型的更多信息:http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/categorical_encoding.html。比如Enum将字符串映射为整数后,可以将{0, 1, 2, 3, 4, 5}拆分为{0, 4, 5}和{1, 2, 3}。

或者在此处查看 h2o-3 如何对分类进行分箱:http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/gbm-faq/histograms_and_binning.html