如何使用 to_json 和 from_json 消除 pyspark 数据框中的嵌套结构域？

Question

理论上，可以完美地满足我的需要，即创建数据框的新副本版本，同时排除某些嵌套的结构字段。这是我的问题的最小可重现工件：

>>> df.printSchema()
root
| -- big: array(nullable=true)
| | -- element: struct(containsNull=true)
| | | -- keep: string(nullable=true)
| | | -- delete: string(nullable=true)

你可以这样实例化：

schema = StructType([StructField("big", ArrayType(StructType([
    StructField("keep", StringType()),
    StructField("delete", StringType())
])))])
df = spark.createDataFrame(spark.sparkContext.emptyRDD(), schema)

我的目标是将数据框（以及我想保留的列中的值）转换为排除某些嵌套结构的数据框，例如 delete。

root
| -- big: array(nullable=true)
| | -- element: struct(containsNull=true)
| | | -- keep: string(nullable=true)

根据我链接的尝试利用 pyspark.sql 的 to_json 和 from_json 函数的解决方案，它应该可以通过这样的方式完成：

new_schema = StructType([StructField("big", ArrayType(StructType([
             StructField("keep", StringType())
])))])

test_df = df.withColumn("big", to_json(col("big"))).withColumn("big", from_json(col("big"), new_schema))

>>> test_df.printSchema()
root
| -- big: struct(nullable=true)
| | -- big: array(nullable=true)
| | | -- element: struct(containsNull=true)
| | | | -- keep: string(nullable=true)

>>> test_df.show()
+----+
| big|
+----+
|null|
+----+

所以要么我没有按照他的指示正确行事，要么就是行不通。没有 udf 怎么办？

Pyspark to_json documentation Pyspark from_json documentation

Answer 1

它应该可以工作，您只需要调整 new_schema 以仅包含列 'big' 的元数据，而不是数据框的元数据：

new_schema = ArrayType(StructType([StructField("keep", StringType())]))

test_df = df.withColumn("big", from_json(to_json("big"), new_schema))

如何使用 to_json 和 from_json 消除 pyspark 数据框中的嵌套结构域？

How to use to_json and from_json to eliminate nested structfields in pyspark dataframe?

python

dataframe

apache-spark-sql

pyspark

pyspark-sql