将 spark 数据帧附加到具有不同列顺序的配置单元 table

Question

我在 HDP3 集群中将 pyspark 与 HiveWarehouseConnector 结合使用。架构发生了变化，因此我使用“alter table”命令更新了我的目标 table，并默认将新列添加到它的最后位置。现在我正在尝试使用以下代码将 spark 数据帧保存到它，但数据帧中的列按字母顺序排列，我收到以下错误消息

df = spark.read.json(df_sub_path)
hive.setDatabase('myDB') 
df.write.format("com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector").mode('append').option('table','target_table').save()

并将错误消息标记为：

Caused by: java.lang.IllegalArgumentException: Hive column: column_x cannot be found at same index: 77 in dataframe. Found column_y. Aborting as this may lead to loading of incorrect data.

是否有任何动态方法可以将数据帧附加到配置单元中的正确位置table？这很重要，因为我希望将更多列添加到目标 table.

Answer 1

您可以读取没有行的目标列来获取列。然后，使用 select，您可以正确地对列进行排序并附加它：

target = hive.executeQuery('select * from target_Table where 1=0')
test = spark.createDataFrame(source.collect())
test = test.select(target.columns)

将 spark 数据帧附加到具有不同列顺序的配置单元 table

Appending spark dataframe to hive table with different columnn order

hive

pyspark

hdp