在 Pyspark 中分解不是数组的结构列

Question

我得到了一个结构列（来自 json），其架构如下：

struct_col_name
 |    |-- 2053771759: struct (nullable = true)
 |    |    |-- col1: long (nullable = true)
 |    |    |-- col2: string (nullable = true)
 |    |    |-- col3: long (nullable = true)
 |    |    |-- col4: string (nullable = true)
 |    |-- 2053771760: struct (nullable = true)
 |    |    |-- col1: long (nullable = true)
 |    |    |-- col2: string (nullable = true)
 |    |    |-- col3: long (nullable = true)
 |    |    |-- col4: string (nullable = true)
 |    |-- 2053771761: struct (nullable = true)
 |    |    |-- col1: long (nullable = true)
 |    |    |-- col2: string (nullable = true)
 |    |    |-- col3: long (nullable = true)
 |    |    |-- col4: string (nullable = true)

由于所有内部结构都有相同的字段，我想将架构转换为类似这样的内容，并添加 id（例如 2053771759）作为每个元素的字段.

通过这样做，我可以将列分解为行。

struct_col_name
 |    |-- element: struct: struct (nullable = true)
 |    |    |    |-- col1: long (nullable = true)
 |    |    |    |-- col2: string (nullable = true)
 |    |    |    |-- col3: long (nullable = true)
 |    |    |    |-- col4: string (nullable = true)
 |    |    |    |-- id: long (nullable = true)

知道我该怎么做吗？或者通过其他方式展开列？

Answer 1

首先，您可以使用 df.select("struct_col_name.*") 数据框的模式获取 IDs 的列表。然后，您可以遍历该列表以更新每个通过将 id 字段添加到现有字段并创建一个结构列数组。最后，逐行分解数组列得到一个struct。

像这样：

from pyspark.sql import functions as F

inner_fields = ["col1", "col2", "col3", "col4"]
ids = df.select("struct_col_name.*").columns

df = df.select(
    F.explode(F.array(*[
        F.struct(*[
            F.col(f"struct_col_name.{i}.{c}") for c in inner_fields
        ], F.lit(i).alias(i))
        for i in ids
    ])).alias("struct_col_name")
)

在 Pyspark 中分解不是数组的结构列

Explode struct column which isn't array in Pyspark

python

apache-spark

pyspark

apache-spark-sql