选择数据框内数组内的结构字段
Selecting fields of structs inside an array inside a dataframe
我有一个从 3 GB json.gz 文件加载的 PySpark 数据框,具有以下架构:
root
|-- _id: long (nullable = false)
|-- quote: string (nullable = true)
|-- occurrences: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- articleID: string (nullable = true)
| | |-- title: string (nullable = true)
| | |-- date: string (nullable = true)
| | |-- author: string (nullable = true)
| | |-- source: string (nullable = true)
我需要删除标题、作者和日期字段,或者创建一个不包含这些字段的新 dataFrame
。
到目前为止,我已经设法获得以下架构:
root
|-- _id: long (nullable = false)
|-- quote: string (nullable = true)
|-- occurrences: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- articleID: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- source: array (nullable = true)
| | | |-- element: string (containsNull = true)
使用
df.select(df._id, df.quote,
array(
struct(
col("occurrences.articleID"),
col("occurrences.source")
)
).alias("occurrences"))
但我需要一种方法将文章 ID 和来源放在同一个 struct
中。我该怎么做?
好的,我找到了一些有用的东西:
clean_df = df.withColumn("exploded",explode("occurrences")).drop("occurrences")
.select(
df._id,
df.quote,
df.exploded.articleID.alias("articleID"),
df.exploded.source.alias("source")
)
.withColumn("occs", struct(col("articleID"), col("source")))
.groupBy("_id", "quote").agg(collect_set("occs").alias("occurrences"))
但如果有人有更好的解决方案,我很想听听,因为这看起来很迂回。 (作为旁注,collect_set 似乎只适用于 java 8。)
我有一个从 3 GB json.gz 文件加载的 PySpark 数据框,具有以下架构:
root
|-- _id: long (nullable = false)
|-- quote: string (nullable = true)
|-- occurrences: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- articleID: string (nullable = true)
| | |-- title: string (nullable = true)
| | |-- date: string (nullable = true)
| | |-- author: string (nullable = true)
| | |-- source: string (nullable = true)
我需要删除标题、作者和日期字段,或者创建一个不包含这些字段的新 dataFrame
。
到目前为止,我已经设法获得以下架构:
root
|-- _id: long (nullable = false)
|-- quote: string (nullable = true)
|-- occurrences: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- articleID: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- source: array (nullable = true)
| | | |-- element: string (containsNull = true)
使用
df.select(df._id, df.quote,
array(
struct(
col("occurrences.articleID"),
col("occurrences.source")
)
).alias("occurrences"))
但我需要一种方法将文章 ID 和来源放在同一个 struct
中。我该怎么做?
好的,我找到了一些有用的东西:
clean_df = df.withColumn("exploded",explode("occurrences")).drop("occurrences")
.select(
df._id,
df.quote,
df.exploded.articleID.alias("articleID"),
df.exploded.source.alias("source")
)
.withColumn("occs", struct(col("articleID"), col("source")))
.groupBy("_id", "quote").agg(collect_set("occs").alias("occurrences"))
但如果有人有更好的解决方案,我很想听听,因为这看起来很迂回。 (作为旁注,collect_set 似乎只适用于 java 8。)