对 enum/categorical 类型列求和时，h2o 数据框 GroupBy 求和函数在做什么？

Question

想知道当列类型为分类（特别是 h2o enum 类型）时，对 h2o 数据框 GroupBy 对象中的列求和时会发生什么。

将 pandas 数据帧转换为 H2o 数据帧。然后按特定列对行进行分组并对其他列求和，例如

location_id  price store
------------------
1            10    JCP
1            15    SBUX
3            20    HOL

then after grouping and summing; df.group_by('location_id').sum(['price', 'store'])

location_id  price store
------------------
1            25    <some number>
3            20    <some number>

想知道将分类列值加在一起时表面下发生了什么，但似乎无法在 h2o docs 中找到 GroupBy 对象的 sum() 源代码。

Answer 1

查看分类编码的 h2o 文档 (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/categorical_encoding.html)，对于 enum 类型（我在我的 h2o 数据框中使用的分类类型）我们看到

enum or Enum: Leave the dataset as is, internally map the strings to integers, and use these integers to make splits - either via ordinal nature when nbins_cats is too small to resolve all levels or via bitsets that do a perfect group split. Each category is a separate category; its name (or number) is irrelevant. For example, after the strings are mapped to integers for Enum, you can split {0, 1, 2, 3, 4, 5} as {0, 4, 5} and {1, 2, 3}.

所以如果我的解释是正确的（有人请告诉我这是否不正确），发生的事情是当将 pandas 框架转换为 h2o 时，它会经历不同的值分配为 enum 类型的列，并为该标签分配一个内部唯一 ID 整数值（用于训练和预测等，但我们通常看不到）。因此，当在那些 enum 列上执行 df.group_by(.).sum(.) 时， 我们只是将 h2o 在数据帧创建时分配的那些列 的所有内部映射整数值相加转换为 h2o 数据框。

再次强调，如果这不是对这里发生的事情的最完整解释，请有人告诉我。

对 enum/categorical 类型列求和时，h2o 数据框 GroupBy 求和函数在做什么？

What is the h2o dataframe GroupBy sum function doing when summing enum/categorical type columns?

h2o