选择数据框内数组内的结构字段

Selecting fields of structs inside an array inside a dataframe

我有一个从 3 GB json.gz 文件加载的 PySpark 数据框,具有以下架构:

root
 |-- _id: long (nullable = false)
 |-- quote: string (nullable = true)
 |-- occurrences: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- articleID: string (nullable = true)
 |    |    |-- title: string (nullable = true)
 |    |    |-- date: string (nullable = true)
 |    |    |-- author: string (nullable = true)
 |    |    |-- source: string (nullable = true)

我需要删除标题、作者和日期字段,或者创建一个不包含这些字段的新 dataFrame

到目前为止,我已经设法获得以下架构:

root
 |-- _id: long (nullable = false)
 |-- quote: string (nullable = true)
 |-- occurrences: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- articleID: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- source: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)

使用

df.select(df._id, df.quote,
      array(
          struct(
              col("occurrences.articleID"), 
              col("occurrences.source")
          )
      ).alias("occurrences"))

但我需要一种方法将文章 ID 和来源放在同一个 struct 中。我该怎么做?

好的,我找到了一些有用的东西:

clean_df = df.withColumn("exploded",explode("occurrences")).drop("occurrences")
            .select(
                df._id, 
                df.quote,
                df.exploded.articleID.alias("articleID"),
                df.exploded.source.alias("source")
             )
    .withColumn("occs", struct(col("articleID"), col("source")))
    .groupBy("_id", "quote").agg(collect_set("occs").alias("occurrences"))

但如果有人有更好的解决方案,我很想听听,因为这看起来很迂回。 (作为旁注,collect_set 似乎只适用于 java 8。)