如何在 pyspark 的数据框中更新结构数组中的值？

Question

我有以下架构：

>>> df.printSchema()
root
... SNIP ...
 |-- foo: array (nullable = true)
 |    |-- element: struct (containsNull = true)
... SNIP ...
 |    |    |-- value: double (nullable = true)
 |    |    |-- value2: double (nullable = true)

在这种情况下，我在数据框中和 foo 数组中只有一行：

>>> df.count()
1
>>> df.select(explode('foo').alias("fooColumn")).count()
1

value 为空：

>>> df.select(explode('foo').alias("fooColumn")).select('fooColumn.value','fooColumn.value2').show()
+-----+------+
|value|value2|
+-----+------+
| null|  null|
+-----+------+

我想编辑 value 并制作一个新的数据框。我可以展开 foo 并设置 value:

>>> fooUpdated = df.select(explode("foo").alias("fooColumn")).select("fooColumn.*").withColumn('value', lit(10)).select('value').show()
+-----+
|value|
+-----+
|   10|
+-----+

如何折叠此数据框以将 fooUpdated 作为具有结构元素的数组放回原处，或者有没有办法在不爆炸 foo 的情况下执行此操作？

最后，我想要的是：

>>> dfUpdated.select(explode('foo').alias("fooColumn")).select('fooColumn.value', 'fooColumn.value2').show()
+-----+------+
|value|value2|
+-----+------+
|   10|  null|
+-----+------+

Answer 1

您可以使用 transform 函数来更新 foo 数组中的每个结构。

这是一个例子：

import pyspark.sql.functions as F

df.printSchema()

#root
# |-- foo: array (nullable = true)
# |    |-- element: struct (containsNull = true)
# |    |    |-- value: string (nullable = true)
# |    |    |-- value2: long (nullable = true)

df1 = df.withColumn(
    "foo",
    F.expr("transform(foo, x -> struct(coalesce(x.value, 10) as value, x.value2 as value2))")
)

现在，您可以显示 df1 中的值以验证它是否已更新：

df1.select(F.expr("inline(foo)")).show()
#+-----+------+
#|value|value2|
#+-----+------+
#|   10|    30|
#+-----+------+

如何在 pyspark 的数据框中更新结构数组中的值？

How to update a value in an array of structs in a dataframe in pyspark?

apache-spark

pyspark

apache-spark-sql