H2O 模型在使用枚举类型训练时错误地将字段视为数值?

H2O model wrongly treating field as numerical when was trained with enum type?

当训练模型时设置的字段类型是 enum

时 H2O DRF 模型将字段类型视为 int 时出现问题

当使用 H2O tree API to examine some of the individual trees in a trained DRF model, I can see that for some types that were explicitly set as enum when the model was trained (ie. the pandas dataframe was converted to an H2OFrame 时,某些字段被设置为具有 column_types 映射参数的特定类型),它们 似乎被视为 ints 在做类似

的事情时
root_node.features
> observe that the feature being examined for this node is one of the features set to be categorical enum by the H2OFrame that the model was trained on
tree.root_node.features
> some_categorical
tree.root_node.levels
> []
root_node.threshold
> some number

更紧凑

print(tree.root_node)

Node ID 0 
Left child node ID = 1 Right child node ID = 2 
Splits on column some_categorical 
Split threshold < 2562.5 to the left node, >= 2562.5 to the right node 
NA values go to the LEFT

但是对于其他节点(对于同一模型)我们(正确地)看到

tree.root_node.features
> some_other_categorical
tree.root_node.levels
> ['cat1', ..., 'catn']
root_node.threshold
> na

最初我以为这似乎只是被视为一个 int,因为类别值在 H2O 中的内部表示方式

enum or Enum: Leave the dataset as is, internally map the strings to integers, and use these integers to make splits - either via ordinal nature when nbins_cats is too small to resolve all levels or via bitsets that do a perfect group split. Each category is a separate category; its name (or number) is irrelevant. For example, after the strings are mapped to integers for Enum, you can split {0, 1, 2, 3, 4, 5} as {0, 4, 5} and {1, 2, 3}.

但是从信息输出显示大于阈值并且没有确定左右方向的级别这一事实来看,您可以看到这里还有一些其他问题。

检查 pandas-to-H2OFrame 转换中使用的 column_types 映射并在训练模型之前打印类型,我们可以看到适当的类型被设置为 enum,所以现在看到的这个输出令人困惑。任何人都知道可以在这里完成的任何其他调试步骤或可能发生的事情吗?

这不是算法中的错误(拆分仍然正确),而是 H2O-3 在 MOJO 树可视化器和树中表示拆分的方式 API。我创建了一个 JIRA 票证,您可以跟踪 here(或添加到),这将确保 MOJO Tree Visualizer 和树 API 拆分不那么混乱(即,使用数字拆分或显示列表的分类级别而不是两者)。您看到的数字拆分对应于我们进行分类拆分的内部方法。