pyspark根据条件从结构的数组列中获取元素

Question

我有一个具有以下架构的 spark df：

 |-- col1 : string
 |-- col2 : string
 |-- customer: struct
 |    |-- smt: string
 |    |-- attributes: array (nullable = true)
 |    |    |-- element: struct
 |    |    |     |-- key: string
 |    |    |     |-- value: string

df:

#+-------+-------+---------------------------------------------------------------------------+
#|col1   |col2   |customer                                                                   |
#+-------+-------+---------------------------------------------------------------------------+
#|col1_XX|col2_XX|"attributes":[[{"key": "A", "value": "123"},{"key": "B", "value": "456"}]  |
#+-------+-------+---------------------------------------------------------------------------+

和数组的 json 输入如下所示：

...
          "attributes": [
            {
              "key": "A",
              "value": "123"
            },
            {
              "key": "B",
              "value": "456"
            }
          ],

我想循环属性数组并获取带有 key="B" 的元素，然后 select 对应的 value。我不想使用 explode 因为我想避免加入数据框。是否可以直接使用spark进行这种操作'Column' ?

预期输出为：

#+-------+-------+-----+
#|col1   |col2   |B    |                                                               |
#+-------+-------+-----+
#|col1_XX|col2_XX|456  |
#+-------+-------+-----+

任何帮助将不胜感激

Answer 1

您可以使用 filter 函数过滤结构数组，然后得到 value:

from pyspark.sql import functions as F

df2 = df.withColumn(
    "B", 
    F.expr("filter(customer.attributes, x -> x.key = 'B')")[0]["value"]
)

pyspark根据条件从结构的数组列中获取元素

pyspark get element from array Column of struct based on condition

python

dataframe

apache-spark

apache-spark-sql

pyspark