如何分解一行中包含多个词典的 PySpark 列

Question

我有一个 spark 数据框，其中一列有多个字典：

id	result
1	{'key1':'a', 'key2':'b'}, {'key1':'d', 'key2':'e'}, {'key1':'m', 'key2':'n'}
2	{'key1':'r', 'key2':'s'}, {'key1':'t', 'key2':'u'}

我需要最终输出为：

id	key1	key2
1	a	b
1	d	e
1	m	n
2	r	s
2	t	u

并计划将其爆炸两次以获得结果。

尽管 result 列属于 StringType()，因此我无法使用 explode 函数分解它：

df.withColumn("output", explode(col("result")))

错误：

AnalysisException: cannot resolve 'explode(result)' due to data type mismatch: input to function explode should be array or map type, not string; 'Project [result#9651, explode(result#9651) AS output#9660] +- Relation[result#9651] json

请帮助解决这个问题。

Answer 1

首先使用from_json函数将result列转换为struct结构的数组，然后使用inline函数展开

json_schema = """
    array<struct<key1:string,key2:string>>
"""
df = df.withColumn('result', F.from_json(F.concat(F.lit('['), 'result', F.lit(']')), json_schema)) \
    .selectExpr('id', 'inline(result)')
df.show(truncate=False)

如何分解一行中包含多个词典的 PySpark 列

How to Explode PySpark column having multiple dictionaries in one row

python

apache-spark

pyspark