将具有整数数组的结构数组压缩到结构列数组中

Zip array of structs with array of ints into array of structs column

我有一个如下所示的数据框:

val sourceData = Seq(
  Row(List(Row("a"), Row("b"), Row("c")), List(1, 2, 3)),
  Row(List(Row("d"), Row("e")), List(4, 5))
)

val sourceSchema = StructType(List(
  StructField("structs", ArrayType(StructType(List(StructField("structField", StringType))))),
  StructField("ints", ArrayType(IntegerType))
))

val sourceDF = sparkSession.createDataFrame(sourceData, sourceSchema)

我想将其转换成如下所示的数据框:

val targetData = Seq(
  Row(List(Row("a", 1), Row("b", 2), Row("c", 3))),
  Row(List(Row("d", 4), Row("e", 5)))
)

val targetSchema = StructType(List(
  StructField("structs", ArrayType(StructType(List(
    StructField("structField", StringType),
    StructField("value", IntegerType)))))
))

val targetDF = sparkSession.createDataFrame(targetData, targetSchema)

到目前为止,我最好的想法是压缩两列,然后 运行 一个将 int 值放入结构中的 UDF。

是否有一种优雅的方式来做到这一点,即不使用 UDF?

您可以在压缩列上使用 array_zip function to zip structs and ints column then you can use transform 函数以获得所需的输出。

sourceDF.withColumn("structs", arrays_zip('structs, 'ints))
    .withColumn("structs",
      expr("transform(structs, s-> struct(s.structs.structField as structField, s.ints as value))"))
    .select("structs")
    .show(false)

+------------------------+
|structs                 |
+------------------------+
|[{a, 1}, {b, 2}, {c, 3}]|
|[{d, 4}, {e, 5}]        |
+------------------------+

使用zip_with函数:

sourceDF.selectExpr(
  "zip_with(structs, ints, (x, y) -> (x.structField as structField, y as value)) as structs"
).show(false)

//+------------------------+
//|structs                 |
//+------------------------+
//|[[a, 1], [b, 2], [c, 3]]|
//|[[d, 4], [e, 5]]        |
//+------------------------+