将具有整数数组的结构数组压缩到结构列数组中
Zip array of structs with array of ints into array of structs column
我有一个如下所示的数据框:
val sourceData = Seq(
Row(List(Row("a"), Row("b"), Row("c")), List(1, 2, 3)),
Row(List(Row("d"), Row("e")), List(4, 5))
)
val sourceSchema = StructType(List(
StructField("structs", ArrayType(StructType(List(StructField("structField", StringType))))),
StructField("ints", ArrayType(IntegerType))
))
val sourceDF = sparkSession.createDataFrame(sourceData, sourceSchema)
我想将其转换成如下所示的数据框:
val targetData = Seq(
Row(List(Row("a", 1), Row("b", 2), Row("c", 3))),
Row(List(Row("d", 4), Row("e", 5)))
)
val targetSchema = StructType(List(
StructField("structs", ArrayType(StructType(List(
StructField("structField", StringType),
StructField("value", IntegerType)))))
))
val targetDF = sparkSession.createDataFrame(targetData, targetSchema)
到目前为止,我最好的想法是压缩两列,然后 运行 一个将 int 值放入结构中的 UDF。
是否有一种优雅的方式来做到这一点,即不使用 UDF?
您可以在压缩列上使用 array_zip function to zip structs and ints column then you can use transform 函数以获得所需的输出。
sourceDF.withColumn("structs", arrays_zip('structs, 'ints))
.withColumn("structs",
expr("transform(structs, s-> struct(s.structs.structField as structField, s.ints as value))"))
.select("structs")
.show(false)
+------------------------+
|structs |
+------------------------+
|[{a, 1}, {b, 2}, {c, 3}]|
|[{d, 4}, {e, 5}] |
+------------------------+
使用zip_with
函数:
sourceDF.selectExpr(
"zip_with(structs, ints, (x, y) -> (x.structField as structField, y as value)) as structs"
).show(false)
//+------------------------+
//|structs |
//+------------------------+
//|[[a, 1], [b, 2], [c, 3]]|
//|[[d, 4], [e, 5]] |
//+------------------------+
我有一个如下所示的数据框:
val sourceData = Seq(
Row(List(Row("a"), Row("b"), Row("c")), List(1, 2, 3)),
Row(List(Row("d"), Row("e")), List(4, 5))
)
val sourceSchema = StructType(List(
StructField("structs", ArrayType(StructType(List(StructField("structField", StringType))))),
StructField("ints", ArrayType(IntegerType))
))
val sourceDF = sparkSession.createDataFrame(sourceData, sourceSchema)
我想将其转换成如下所示的数据框:
val targetData = Seq(
Row(List(Row("a", 1), Row("b", 2), Row("c", 3))),
Row(List(Row("d", 4), Row("e", 5)))
)
val targetSchema = StructType(List(
StructField("structs", ArrayType(StructType(List(
StructField("structField", StringType),
StructField("value", IntegerType)))))
))
val targetDF = sparkSession.createDataFrame(targetData, targetSchema)
到目前为止,我最好的想法是压缩两列,然后 运行 一个将 int 值放入结构中的 UDF。
是否有一种优雅的方式来做到这一点,即不使用 UDF?
您可以在压缩列上使用 array_zip function to zip structs and ints column then you can use transform 函数以获得所需的输出。
sourceDF.withColumn("structs", arrays_zip('structs, 'ints))
.withColumn("structs",
expr("transform(structs, s-> struct(s.structs.structField as structField, s.ints as value))"))
.select("structs")
.show(false)
+------------------------+
|structs |
+------------------------+
|[{a, 1}, {b, 2}, {c, 3}]|
|[{d, 4}, {e, 5}] |
+------------------------+
使用zip_with
函数:
sourceDF.selectExpr(
"zip_with(structs, ints, (x, y) -> (x.structField as structField, y as value)) as structs"
).show(false)
//+------------------------+
//|structs |
//+------------------------+
//|[[a, 1], [b, 2], [c, 3]]|
//|[[d, 4], [e, 5]] |
//+------------------------+