使用 pyspark 使用嵌套结构 ArrayType 展平数据框

Question

我有一个具有此架构的数据框

root
  |-- AUTHOR_ID: integer (nullable = false)
  |-- NAME: string (nullable = true)
  |-- Books: array (nullable = false)
  |    |-- element: struct (containsNull = false)
  |    |    |-- BOOK_ID: integer (nullable = false)
  |    |    |-- Chapters: array (nullable = true) 
  |    |    |    |-- element: struct (containsNull = true)
  |    |    |    |    |-- NAME: string (nullable = true)
  |    |    |    |    |-- NUMBER_PAGES: integer (nullable = true)

如何使用 Pyspark 将所有列合并为一层？

Answer 1

使用inline函数：

df2 = (df.selectExpr("AUTHOR_ID", "NAME", "inline(Books)")
       .selectExpr("*", "inline(Chapters)")
       .drop("Chapters")
       )

或explode:

from pyspark.sql import functions as F

df2 = (df.withColumn("Books", F.explode("Books"))
       .select("*", "Books.*")
       .withColumn("Chapters", F.explode("Chapters"))
       .select("*", "Chapters.*")
       )

使用 pyspark 使用嵌套结构 ArrayType 展平数据框

Flatten dataframe with nested struct ArrayType using pyspark

dataframe

apache-spark

pyspark

apache-spark-sql