火花阅读镶木地板中缺失的列

Question

我有需要从 spark 读取的镶木地板文件。一些文件缺少几列，这些列存在于新文件中。

由于我不知道哪些文件缺少列，我需要读取spark中的所有文件。我有我需要阅读的列列表。也可能是所有文件都缺少某些列。我需要在那些缺失的列中放置一个空值。

当我尝试做一个 sqlContext.sql('query') 提示我缺少列的错误

如果我定义架构并执行

sqlContext.read.parquet('s3://....').schema(parquet_schema)

它给了我同样的错误。

在这里帮帮我

Answer 1

您需要使用 parquet 架构演进策略来解决这种情况。

如 spark 文档中所定义

Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet files with different but mutually compatible schemas. The Parquet data source is now able to automatically detect this case and merge schemas of all these files.

您只需

val mergedDF = spark.read.option("mergeSchema", "true").parquet("'s3://....'")

这将为您提供具有完整架构的镶木地板数据。

痛点

如果您的架构不兼容，例如一个镶木地板文件的 col1 数据类型为 String，而另一个镶木地板文件的 col1 数据类型为 Long。

那么合并架构将失败。

火花阅读镶木地板中缺失的列

spark reading missing columns in parquet

apache-spark

parquet