cloudant-spark 连接器使用嵌套 JSON 架构创建重复的列名

Question

我在我的 cloudant 数据库中使用以下 JSON 模式：

{...
 departureWeather:{
    temp:30,
    otherfields:xyz
 },
 arrivalWeather:{
    temp:45,
    otherfields: abc
 }
 ...
}

然后我使用 cloudant-spark 连接器将数据加载到数据框中。如果我像这样尝试 select 字段：

df.select("departureWeather.temp", "arrivalWeather.temp")

我最终得到一个包含 2 列同名的数据框，例如温度。看起来 Spark 数据源框架仅使用最后一部分来扁平化名称。

是否有简单的列名去重方法？

Answer 1

您可以使用别名：

df.select(
    col("departureWeather.temp").alias("departure_temp"),
    col("arrivalWeather.temp").alias("arrival_temp")
)

cloudant-spark connector creates duplicate column name with nested JSON schema