Spark SQL 忽略 TBLPROPERTIES 中指定的 parquet.compression 属性

Question

我需要从 Spark SQL 创建一个 Hive table，它将采用 PARQUET 格式和 SNAPPY 压缩。以下代码以 PARQUET 格式创建 table，但使用 GZIP 压缩：

hiveContext.sql("create table NEW_TABLE stored as parquet tblproperties ('parquet.compression'='SNAPPY') as select * from OLD_TABLE")

但在 Hue "Metastore Tables" -> TABLE -> "Properties" 中它仍然显示：

|  Parameter            |  Value   |
| ================================ |
|  parquet.compression  |  SNAPPY  |

如果我将 SNAPPY 更改为任何其他字符串，例如ABCDE 除了压缩仍然是 GZIP 之外，代码仍然可以正常工作：

hiveContext.sql("create table NEW_TABLE stored as parquet tblproperties ('parquet.compression'='ABCDE') as select * from OLD_TABLE")

和色调 "Metastore Tables" -> TABLE -> "Properties" 显示：

|  Parameter            |  Value   |
| ================================ |
|  parquet.compression  |  ABCDE   |

这让我觉得 TBLPROPERTIES 只是被 Spark SQL 忽略了。

注意： 我尝试直接从 Hive 运行相同的查询，以防属性等于 SNAPPY table已通过适当的压缩成功创建（即 SNAPPY 而不是 GZIP）。

create table NEW_TABLE stored as parquet tblproperties ('parquet.compression'='ABCDE') as select * from OLD_TABLE

如果属性是 ABCDE，则查询没有失败，但未创建 table。

问题是什么问题？

Answer 1

直接来自Spark documentation

When reading from and writing to Hive metastore Parquet tables, Spark SQL will try to use its own Parquet support instead of Hive SerDe for better performance.

然后就在下面，您将找到一些控制 Spark 是否强制执行所有 Hive 选项（和性能...）的属性，即 spark.sql.hive.convertMetastoreParquet，以及如何处理 Parquet 文件上的原始 read/write，例如作为 spark.sql.parquet.compression.codec （默认为 gzip - 你不应该感到惊讶） 或 spark.sql.parquet.int96AsTimestamp.

无论如何，"default compression" 属性只是指示性的。在相同的 table 和目录中，每个 Parquet 文件可能有自己的压缩设置——以及页面大小、HDFS 块大小等。

Answer 2

这是对我有用的组合 (Spark 2.1.0)：

spark.sql("SET spark.sql.parquet.compression.codec=GZIP")
spark.sql("CREATE TABLE test_table USING PARQUET PARTITIONED BY (date) AS SELECT * FROM test_temp_table")

在 HDFS 中验证：

/user/hive/warehouse/test_table/date=2017-05-14/part-00000-uid.gz.parquet

Spark SQL 忽略 TBLPROPERTIES 中指定的 parquet.compression 属性

Spark SQL ignores parquet.compression propertie specified in TBLPROPERTIES

hiveql

parquet

apache-spark-sql