将 Apache Orc 文件列名与列统计信息匹配

Matching Apache Orc file column names to the column statistics

你如何link Orc 文件的 ColumnStatistics with the column name defined in in the schema (TypeDescription) 使用 Java?

    Reader reader = OrcFile.createReader(ignored);
    TypeDescription schema = reader.getSchema();
    ColumnStatistics[] stats = reader.getStatistics();

列统计信息包含平面数组中所有列类型的统计信息。然而,模式是模式树。列统计数据是模式的树遍历(深度优先?)吗?

我尝试使用 orc-statistics 但它只输出列 ID。

事实证明文件统计信息与架构的 DFS 遍历匹配。遍历包括不保存数据的中间模式,如 Struct 和 List。此外,遍历包括作为第一个节点的整体模式。 Orc Specification v1:

的文档对此进行了解释

The type tree is flattened in to a list via a pre-order traversal where each type is assigned the next id. Clearly the root of the type tree is always type id 0. Compound types have a field named subtypes that contains the list of their children’s type ids.

从 Orc TypeDescription 获取模式名称的扁平化列表的完整代码:

final class OrcSchemas {
  private OrcSchemas() {}

  /**
   * Returns all schema names in a depth-first traversal of schema.
   *
   * <p>The given schema is represented as '<ROOT>'. Intermediate, unnamed schemas like
   * StructColumnVector and ListColumnVector are represented using their category, like:
   * 'parent::<STRUCT>::field'.
   *
   * <p>This method is useful because some Orc file methods like statistics return all column stats
   * in a single flat array. The single flat array is a depth-first traversal of all columns in a
   * schema, including intermediate columns like structs and lists.
   */
  static ImmutableList<String> flattenNames(TypeDescription schema) {
    if (schema.getChildren().isEmpty()) {
      return ImmutableList.of();
    }
    ArrayList<String> names = Lists.newArrayListWithExpectedSize(schema.getChildren().size());
    names.add("<ROOT>");
    mutateAddNamesDfs("", schema, names);
    return ImmutableList.copyOf(names);
  }

  private static void mutateAddNamesDfs(
      String parentName, TypeDescription schema, List<String> dfsNames) {
    String separator = "::";
    ImmutableList<String> schemaNames = getFieldNames(parentName, schema);
    ImmutableList<TypeDescription> children = getChildren(schema);
    for (int i = 0; i < children.size(); i++) {
      String name = schemaNames.get(i);
      dfsNames.add(name);
      TypeDescription childSchema = schema.getChildren().get(i);
      mutateAddNamesDfs(name + separator, childSchema, dfsNames);
    }
  }

  private static ImmutableList<TypeDescription> getChildren(TypeDescription schema) {
    return Optional.ofNullable(schema.getChildren())
        .map(ImmutableList::copyOf)
        .orElse(ImmutableList.of());
  }

  private static ImmutableList<String> getFieldNames(String parentName, TypeDescription schema) {
    final List<String> names;
    try {
      // For some reason, getFieldNames doesn't handle null.
      names = schema.getFieldNames();
    } catch (NullPointerException e) {
      // If there's no children, there's definitely no field names.
      if (schema.getChildren() == null) {
        return ImmutableList.of();
      }
      // There are children, so use the category since there's no names. This occurs with
      // structs and lists.
      return schema.getChildren().stream()
          .map(child -> parentName + "<" + child.getCategory() + ">")
          .collect(toImmutableList());
    }
    return names.stream().map(n -> parentName + n).collect(toImmutableList());
  }
}