在 Pyspark 中分解不是数组的结构列
Explode struct column which isn't array in Pyspark
我得到了一个结构列(来自 json),其架构如下:
struct_col_name
| |-- 2053771759: struct (nullable = true)
| | |-- col1: long (nullable = true)
| | |-- col2: string (nullable = true)
| | |-- col3: long (nullable = true)
| | |-- col4: string (nullable = true)
| |-- 2053771760: struct (nullable = true)
| | |-- col1: long (nullable = true)
| | |-- col2: string (nullable = true)
| | |-- col3: long (nullable = true)
| | |-- col4: string (nullable = true)
| |-- 2053771761: struct (nullable = true)
| | |-- col1: long (nullable = true)
| | |-- col2: string (nullable = true)
| | |-- col3: long (nullable = true)
| | |-- col4: string (nullable = true)
由于所有内部结构都有相同的字段,我想将架构转换为类似这样的内容,并添加 id
(例如 2053771759
)作为每个元素的字段.
通过这样做,我可以将列分解为行。
struct_col_name
| |-- element: struct: struct (nullable = true)
| | | |-- col1: long (nullable = true)
| | | |-- col2: string (nullable = true)
| | | |-- col3: long (nullable = true)
| | | |-- col4: string (nullable = true)
| | | |-- id: long (nullable = true)
知道我该怎么做吗?或者通过其他方式展开列?
首先,您可以使用 df.select("struct_col_name.*")
数据框的模式获取 IDs
的列表。然后,您可以遍历该列表以更新每个
通过将 id
字段添加到现有字段并创建一个结构列数组。最后,逐行分解数组列得到一个struct。
像这样:
from pyspark.sql import functions as F
inner_fields = ["col1", "col2", "col3", "col4"]
ids = df.select("struct_col_name.*").columns
df = df.select(
F.explode(F.array(*[
F.struct(*[
F.col(f"struct_col_name.{i}.{c}") for c in inner_fields
], F.lit(i).alias(i))
for i in ids
])).alias("struct_col_name")
)
我得到了一个结构列(来自 json),其架构如下:
struct_col_name
| |-- 2053771759: struct (nullable = true)
| | |-- col1: long (nullable = true)
| | |-- col2: string (nullable = true)
| | |-- col3: long (nullable = true)
| | |-- col4: string (nullable = true)
| |-- 2053771760: struct (nullable = true)
| | |-- col1: long (nullable = true)
| | |-- col2: string (nullable = true)
| | |-- col3: long (nullable = true)
| | |-- col4: string (nullable = true)
| |-- 2053771761: struct (nullable = true)
| | |-- col1: long (nullable = true)
| | |-- col2: string (nullable = true)
| | |-- col3: long (nullable = true)
| | |-- col4: string (nullable = true)
由于所有内部结构都有相同的字段,我想将架构转换为类似这样的内容,并添加 id
(例如 2053771759
)作为每个元素的字段.
通过这样做,我可以将列分解为行。
struct_col_name
| |-- element: struct: struct (nullable = true)
| | | |-- col1: long (nullable = true)
| | | |-- col2: string (nullable = true)
| | | |-- col3: long (nullable = true)
| | | |-- col4: string (nullable = true)
| | | |-- id: long (nullable = true)
知道我该怎么做吗?或者通过其他方式展开列?
首先,您可以使用 df.select("struct_col_name.*")
数据框的模式获取 IDs
的列表。然后,您可以遍历该列表以更新每个
通过将 id
字段添加到现有字段并创建一个结构列数组。最后,逐行分解数组列得到一个struct。
像这样:
from pyspark.sql import functions as F
inner_fields = ["col1", "col2", "col3", "col4"]
ids = df.select("struct_col_name.*").columns
df = df.select(
F.explode(F.array(*[
F.struct(*[
F.col(f"struct_col_name.{i}.{c}") for c in inner_fields
], F.lit(i).alias(i))
for i in ids
])).alias("struct_col_name")
)