在 Pyspark 中分解不是数组的结构列

Explode struct column which isn't array in Pyspark

我得到了一个结构列(来自 json),其架构如下:

struct_col_name
 |    |-- 2053771759: struct (nullable = true)
 |    |    |-- col1: long (nullable = true)
 |    |    |-- col2: string (nullable = true)
 |    |    |-- col3: long (nullable = true)
 |    |    |-- col4: string (nullable = true)
 |    |-- 2053771760: struct (nullable = true)
 |    |    |-- col1: long (nullable = true)
 |    |    |-- col2: string (nullable = true)
 |    |    |-- col3: long (nullable = true)
 |    |    |-- col4: string (nullable = true)
 |    |-- 2053771761: struct (nullable = true)
 |    |    |-- col1: long (nullable = true)
 |    |    |-- col2: string (nullable = true)
 |    |    |-- col3: long (nullable = true)
 |    |    |-- col4: string (nullable = true)

由于所有内部结构都有相同的字段,我想将架构转换为类似这样的内容,并添加 id(例如 2053771759)作为每个元素的字段.

通过这样做,我可以将列分解为行。

struct_col_name
 |    |-- element: struct: struct (nullable = true)
 |    |    |    |-- col1: long (nullable = true)
 |    |    |    |-- col2: string (nullable = true)
 |    |    |    |-- col3: long (nullable = true)
 |    |    |    |-- col4: string (nullable = true)
 |    |    |    |-- id: long (nullable = true)

知道我该怎么做吗?或者通过其他方式展开列?

首先,您可以使用 df.select("struct_col_name.*") 数据框的模式获取 IDs 的列表。然后,您可以遍历该列表以更新每个 通过将 id 字段添加到现有字段并创建一个结构列数组。最后,逐行分解数组列得到一个struct。

像这样:

from pyspark.sql import functions as F

inner_fields = ["col1", "col2", "col3", "col4"]
ids = df.select("struct_col_name.*").columns

df = df.select(
    F.explode(F.array(*[
        F.struct(*[
            F.col(f"struct_col_name.{i}.{c}") for c in inner_fields
        ], F.lit(i).alias(i))
        for i in ids
    ])).alias("struct_col_name")
)