AVRO 文件上的 Hive External table 只为所有列生成 NULL 数据

Question

我正在尝试在一些使用 spark-scala 生成的 avro 文件之上创建一个 Hive external table。我正在使用 CDH 5.16，它有 hive 1.1、spark 1.6.

我创建了 hive external table，运行成功了。但是当我查询数据时，我得到所有列的 NULL。 My problem is similar to this

经过一些研究，我发现这可能是模式的问题。但是我无法在该位置找到这些 avro 文件的架构文件。

我对 avro 文件类型还很陌生。有人可以帮我解决这个问题吗？

下面是我的 spark 代码片段，我将文件保存为 avro:

df.write.mode(SaveMode.Overwrite).format("com.databricks.spark.avro").save("hdfs:path/user/hive/warehouse/transform.db/prod_order_avro")

下面是我的 hive 外部 table 创建语句：

create external table prod_order_avro
(ProductID string,
ProductName string,
categoryname string,
OrderDate string,
Freight string,
OrderID string,
ShipperID string,
Quantity string,
Sales string,
Discount string,
COS string,
GP string,
CategoryID string,
oh_Updated_time string,
od_Updated_time string
)
STORED AS AVRO
LOCATION '/user/hive/warehouse/transform.db/prod_order_avro';

下面是我查询数据时得到的结果： select * from prod_order_avro

同时，当我使用 spark-scala 作为 dataframe 读取这些 avro 文件并打印它们时，我得到了正确的结果。下面是我用来读取这些数据的 spark 代码：

val df=hiveContext.read.format("com.databricks.spark.avro").option("header","true").load("hdfs:path/user/hive/warehouse/transform.db/prod_order_avro")

我的问题是，

创建这些 avro 文件时，我是否需要更改我的 spark
单独创建架构文件的代码，还是将其嵌入
文件。如果需要分开，那怎么实现呢？
如果不是如何创建 hive table 以便从自动归档。我读到最新版本的蜂巢会处理如果文件中存在模式，则此问题本身。

请帮帮我

Answer 1

解决了这个……这是一个架构问题。该架构未嵌入 avro files.So 我不得不使用 avro-tools 提取架构并在创建 table 时传递它。现在可以使用了。

我遵循了以下步骤：

从存储在 hdfs 中的 avro 个文件中提取少量数据到一个文件中本地系统。下面是用于相同的命令：

sudo hdfs dfs -cat /path/file.avro | head --bytes 10K > /path/temp.txt
使用 avro-tools getschema 命令从此数据中提取模式：

avro-tools getschema /path/temp.txt
将生成的模式（它将以 json 数据的形式）复制到一个新的扩展名为 .avsc 的文件并将其上传到 HDFS
在创建 Hive External table 时向其中添加以下属性：

TBLPROPERTIES('avro.schema.url'='hdfs://path/schema.avsc')

AVRO 文件上的 Hive External table 只为所有列生成 NULL 数据

Hive External table on AVRO file producing only NULL data for all columns

hadoop

hive

avro

spark-avro

hive-table