如何使用 PySpark 将复杂的 JSON 转换为数据帧？

Question

我需要一个 python 代码来将 JSON 转换为数据帧。

我的 JSON 格式是

{"feed":{"catalog":{"schema":["somekey":"somevalue"], "add":{"items":[{["somekey":"somevalue"]}]}}....

我想将 JSON 放入项目中存在的多个数据框中。

例如：

输入JSON

{"feed":{"catalog":{"schema":["somekey":"somevalue"], "add":{"items":[{[{"sku":"10002","status":"Enabled"},{"sku":"10003","status":"Enabled"}]}]}}

预期数据帧

sku      status

100002   Enabled

1000023  Enabled

在此先感谢，请帮助解决问题。

Answer 1

您需要展开 items 数组以获得 sku,status 列。

#sample valid json
jsn='{"feed":{"catalog":{"schema":["somekey","somevalue"], "add":{"items":[{"sku":"10002","status":"Enabled"},{"sku":"10003","status":"Enabled"}]}}}}'

#read the json using spark.read.json
df=spark.read.json(sc.parallelize([jsn]))

#print schema
df.printSchema()
#root
# |-- feed: struct (nullable = true)
# |    |-- catalog: struct (nullable = true)
# |    |    |-- add: struct (nullable = true)
# |    |    |    |-- items: array (nullable = true)
# |    |    |    |    |-- element: struct (containsNull = true)
# |    |    |    |    |    |-- sku: string (nullable = true)
# |    |    |    |    |    |-- status: string (nullable = true)
# |    |    |-- schema: array (nullable = true)
# |    |    |    |-- element: string (containsNull = true)

df.withColumn("items",explode(col("feed.catalog.add.items"))).\
select("items.*").\
show()
#+-----+-------+
#|  sku| status|
#+-----+-------+
#|10002|Enabled|
#|10003|Enabled|
#+-----+-------+

如何使用 PySpark 将复杂的 JSON 转换为数据帧？

How to convert complex JSON to dataframe by using PySpark?

python

json

pyspark

pyspark-dataframes