在配置单元中，如何从 table 生成数组类型数据

Question

我有一个配置单元 table，其列如下：

root
 |-- id: string (nullable = true)
 |-- address: string (nullable = true)
 |-- address_id: string (nullable = true)
 |-- bay: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- bay_id: string (nullable = true)
 |    |    |-- section_id: long (nullable = true)

而且对于一个id，有很多相关的地址和相关的bay数组（hive parquet的数组类型table）。我想生成一个新的 table，例如：

id, array(related_address, related_array, ...)
root
 |-- id: string (nullable = true)
 |-- Address: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- address: string (nullable = true)
 |    |    |-- address_id: string (nullable = true)
 |    |    |-- bay: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- bay_id: string (nullable = true)
 |    |    |    |    |-- section_id: long (nullable = true)

目前我将DataFrame转为RDD来收集相关数据：

dt.rdd.map(r=>{
  val id = r.getAs[String](1)
  val rest = (2 to columnLength).map(x=>r.get(x))
  (id, Row(rest: _*))
}.groupByKey().map(tuple=>Row(tuple._1, tuple._2.toSeq)

我得到了类似 Row[id, Array[related_address, related_array, ...]] 的东西，然后，我制作了一个类似 StructType(structTypeOfId +: ArrayType(relatedAddrType, ...)) 的模式。最后，我将通过模式和 RDD 创建所需的 DataFrame。

但是我如何仅通过配置单元获得所需的模式？通过 RDD 的方法非常非常慢！

Answer 1

终于找到了解决这个问题的方法。在 UDF 中使用配置单元构建：struct 和 collect_set。 struct 会将传递给它的所有列打包在一个新的结构中，然后，您可以使用 collect_set （或根据要求使用 collect_list ）构建该结构的数组。代码如：

select id, collect_set(struct(address, address_id, bay)) as Address from oriTable;

旧版本collect_set不能接收struct，只能接收primitive column，新版本支持struct。

在配置单元中，如何从 table 生成数组类型数据

In hive, how to generate array type data from a table

arrays

hive

hiveql