将 Apache Orc 文件列名与列统计信息匹配

Question

你如何link Orc 文件的 ColumnStatistics with the column name defined in in the schema (TypeDescription) 使用 Java？

    Reader reader = OrcFile.createReader(ignored);
    TypeDescription schema = reader.getSchema();
    ColumnStatistics[] stats = reader.getStatistics();

列统计信息包含平面数组中所有列类型的统计信息。然而，模式是模式树。列统计数据是模式的树遍历（深度优先？）吗？

我尝试使用 orc-statistics 但它只输出列 ID。

Answer 1

事实证明文件统计信息与架构的 DFS 遍历匹配。遍历包括不保存数据的中间模式，如 Struct 和 List。此外，遍历包括作为第一个节点的整体模式。 Orc Specification v1:

的文档对此进行了解释

The type tree is flattened in to a list via a pre-order traversal where each type is assigned the next id. Clearly the root of the type tree is always type id 0. Compound types have a field named subtypes that contains the list of their children’s type ids.

从 Orc TypeDescription 获取模式名称的扁平化列表的完整代码：

final class OrcSchemas {
  private OrcSchemas() {}

  /**
   * Returns all schema names in a depth-first traversal of schema.
   *
   * <p>The given schema is represented as '<ROOT>'. Intermediate, unnamed schemas like
   * StructColumnVector and ListColumnVector are represented using their category, like:
   * 'parent::<STRUCT>::field'.
   *
   * <p>This method is useful because some Orc file methods like statistics return all column stats
   * in a single flat array. The single flat array is a depth-first traversal of all columns in a
   * schema, including intermediate columns like structs and lists.
   */
  static ImmutableList<String> flattenNames(TypeDescription schema) {
    if (schema.getChildren().isEmpty()) {
      return ImmutableList.of();
    }
    ArrayList<String> names = Lists.newArrayListWithExpectedSize(schema.getChildren().size());
    names.add("<ROOT>");
    mutateAddNamesDfs("", schema, names);
    return ImmutableList.copyOf(names);
  }

  private static void mutateAddNamesDfs(
      String parentName, TypeDescription schema, List<String> dfsNames) {
    String separator = "::";
    ImmutableList<String> schemaNames = getFieldNames(parentName, schema);
    ImmutableList<TypeDescription> children = getChildren(schema);
    for (int i = 0; i < children.size(); i++) {
      String name = schemaNames.get(i);
      dfsNames.add(name);
      TypeDescription childSchema = schema.getChildren().get(i);
      mutateAddNamesDfs(name + separator, childSchema, dfsNames);
    }
  }

  private static ImmutableList<TypeDescription> getChildren(TypeDescription schema) {
    return Optional.ofNullable(schema.getChildren())
        .map(ImmutableList::copyOf)
        .orElse(ImmutableList.of());
  }

  private static ImmutableList<String> getFieldNames(String parentName, TypeDescription schema) {
    final List<String> names;
    try {
      // For some reason, getFieldNames doesn't handle null.
      names = schema.getFieldNames();
    } catch (NullPointerException e) {
      // If there's no children, there's definitely no field names.
      if (schema.getChildren() == null) {
        return ImmutableList.of();
      }
      // There are children, so use the category since there's no names. This occurs with
      // structs and lists.
      return schema.getChildren().stream()
          .map(child -> parentName + "<" + child.getCategory() + ">")
          .collect(toImmutableList());
    }
    return names.stream().map(n -> parentName + n).collect(toImmutableList());
  }
}

将 Apache Orc 文件列名与列统计信息匹配

Matching Apache Orc file column names to the column statistics

java

columnstore

orc