为什么在 Spark SQL 写入后 Impala 无法读取镶木地板文件?

Why can't Impala read parquet files after Spark SQL's write?

Spark 解释 parquet 列的方式存在一些问题。

我有一个带有确认模式的 Oracle 源(df.schema() 方法):

root
  |-- LM_PERSON_ID: decimal(15,0) (nullable = true)
  |-- LM_BIRTHDATE: timestamp (nullable = true)
  |-- LM_COMM_METHOD: string (nullable = true)
  |-- LM_SOURCE_IND: string (nullable = true)
  |-- DATASET_ID: decimal(38,0) (nullable = true)
  |-- RECORD_ID: decimal(38,0) (nullable = true)

然后将其保存为 Parquet - df.write().parquet() 方法 - 具有相应的消息类型(由 Spark 确定):

  message spark_schema {
    optional int64 LM_PERSON_ID (DECIMAL(15,0));
    optional int96 LM_BIRTHDATE;
    optional binary LM_COMM_METHOD (UTF8);
    optional binary LM_SOURCE_IND (UTF8);
    optional fixed_len_byte_array(16) DATASET_ID (DECIMAL(38,0));
    optional fixed_len_byte_array(16) RECORD_ID (DECIMAL(38,0));
}

然后我的应用程序使用 HashMap 生成 table DDL 进行类型转换,例如:

CREATE EXTERNAL TABLE IF NOT EXISTS 
ELM_PS_LM_PERSON (
LM_PERSON_ID DECIMAL(15,0)
,LM_BIRTHDATE TIMESTAMP
,LM_COMM_METHOD STRING
,LM_SOURCE_IND STRING
,DATASET_ID DECIMAL(38,0)
,RECORD_ID DECIMAL(38,0)
) PARTITIONED BY (edi_business_day STRING) STORED AS PARQUET LOCATION '<PATH>'

我的问题是 table 将无法被 Impala 读取,因为它不接受 LM_PERSON_ID 作为十进制字段。如果此列设置为 BIGINT,table 将仅读取 parquet 文件。

Query 8d437faf6323f0bb:b7ba295d028c8fbe: 0% Complete (0 out of 1)
File 'hdfs:dev/ELM/ELM_PS_LM_PERSON/part-00000-fcdbd3a5-9c93-490e-a124-c2a327a17a17.snappy.parquet' has an incompatible Parquet schema for column 'rbdshid1.elm_ps_lm_person_2.lm_person_id'. 
Column type: DOUBLE, Parquet schema:
optional int64 LM_PERSON_ID [i:0 d:1 r:0]

我如何知道何时用 Decimal 字段替换 BIGINT?

Parquet 消息类型已记录但无法访问?

两个十进制字段转换为fixed_len_byte_array(16),LM_PERSON_ID转换为int64

我能想到的唯一解决方案是创建 table,测试它是否 returns,如果不是,则将 decimal 字段一一替换为 BIGINT,每次都进行测试。

我在这里错过了什么?我可以为十进制的镶木地板文件强制执行架构吗?

来自Apache Spark官方文档中的Configuration section of Parquet Files

spark.sql.parquet.writeLegacyFormat (default: false)

If true, data will be written in a way of Spark 1.4 and earlier. For example, decimal values will be written in Apache Parquet's fixed-length byte array format, which other systems such as Apache Hive and Apache Impala use. If false, the newer format in Parquet will be used. For example, decimals will be written in int-based format. If Parquet output is intended for use with systems that do not support this newer format, set to true.

官方文档更新前给出的答案

非常相似的 SPARK-20297 Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala 最近 (20/Apr/17 01:59) 解决为不是问题。

要点是使用 spark.sql.parquet.writeLegacyFormat 属性 并以旧格式编写 parquet 元数据(我没有在 Configuration and reported as an improvement in SPARK-20937 下的官方文档中看到描述)。

Data written by Spark is readable by Hive and Impala when spark.sql.parquet.writeLegacyFormat is enabled.

It does follow the newer standard - https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal and I missed the documentation. Wouldn't it be then bugs in Impala or Hive?

The int32/int64 options were present in the original version of the decimal spec, they just weren't widely implemented: https://github.com/Parquet/parquet-format/commit/b2836e591da8216cfca47075baee2c9a7b0b9289 . So its not a new/old version thing, it was just an alternative representation that many systems didn't implement.

这篇 SPARK-10400 也可能是一本很有帮助的读物(关于 spark.sql.parquet.writeLegacyFormat 属性 的历史):

We introduced SQL option "spark.sql.parquet.followParquetFormatSpec" while working on implementing Parquet backwards-compatibility rules in SPARK-6777. It indicates whether we should use legacy Parquet format adopted by Spark 1.4 and prior versions or the standard format defined in parquet-format spec. However, the name of this option is somewhat confusing, because it's not super intuitive why we shouldn't follow the spec. Would be nice to rename it to "spark.sql.parquet.writeLegacyFormat" and invert its default value (they have opposite meanings). Note that this option is not "public" (isPublic is false).