将两个具有不同数据类型的 JSON 展平并加入它们
Flattening two JSON with different data types and joining them
我正在尝试压平两个 JSON 文件(我们称它们为 JSON1
和 JSON2
)。下面是它们的外观示例。
现在在一个文件中,列数据类型可以是结构,而在另一个文件中是字符串。最终目标是能够将这些文件和 combine/join/merge 数据扁平化为 CSV 文件。如何使用 Python 在 Spark 中完成此操作?
JSON1:
{
"result": [
{
"promoted_by": "",
"parent": "",
"number": "310346",
"closed_by": {
"link": "https://abcdev.service-now.com/api/now/table/sys_user/e4b0dd",
"value": "e4b0dd"
}
}
]
}
root
|-- result: struct (nullable = true)
| |-- closed_by: struct (nullable = true)
| | |-- link: string (nullable = true)
| | |-- value: string (nullable = true)
| |-- number: string (nullable = true)
| |-- parent: string (nullable = true)
| |-- promoted_by: string (nullable = true)
JSON2:
{
"result": [
{
"promoted_by": "",
"parent": {
"link": "https://abcdev.service-now.com/api/now/table/sys_user/ab00f1",
"value": "ab00f1"
},
"number": "310348",
"closed_by": ""
}
]
}
root
|-- result: struct (nullable = true)
| |-- closed_by: string (nullable = true)
| |-- number: string (nullable = true)
| |-- parent: struct (nullable = true)
| | |-- link: string (nullable = true)
| | |-- value: string (nullable = true)
| |-- promoted_by: string (nullable = true)
只需将 2 个 JSON 文件读入同一个 DataFrame。模式将由 Spark 自动合并。 closed_by
和 parent
列的类型都是 struct
:
df = spark.read.json("dbfs:/mnt/{json1.json,json2.json}", multiLine=True)
df.printSchema()
#root
# |-- result: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- closed_by: struct (nullable = true)
# | | | |-- link: string (nullable = true)
# | | | |-- value: string (nullable = true)
# | | |-- number: string (nullable = true)
# | | |-- parent: struct (nullable = true)
# | | | |-- link: string (nullable = true)
# | | | |-- value: string (nullable = true)
# | | |-- promoted_by: string (nullable = true)
要展平结构,请使用 explode
+ 星号展开结构:
from pyspark.sql import functions as F
df1 = df.select(F.explode("result").alias("results")).select("results.*") \
.select(
F.col("number"),
F.col("closed_by.value").alias("closed_by_value"),
F.col("closed_by.link").alias("closed_by_link"),
F.col("parent.value").alias("parent_value"),
F.col("parent.link").alias("parent_link"),
F.col("promoted_by")
)
df1.printSchema()
#root
# |-- number: string (nullable = true)
# |-- closed_by_value: string (nullable = true)
# |-- closed_by_link: string (nullable = true)
# |-- parent_value: string (nullable = true)
# |-- parent_link: string (nullable = true)
# |-- promoted_by: string (nullable = true)
我正在尝试压平两个 JSON 文件(我们称它们为 JSON1
和 JSON2
)。下面是它们的外观示例。
现在在一个文件中,列数据类型可以是结构,而在另一个文件中是字符串。最终目标是能够将这些文件和 combine/join/merge 数据扁平化为 CSV 文件。如何使用 Python 在 Spark 中完成此操作?
JSON1:
{
"result": [
{
"promoted_by": "",
"parent": "",
"number": "310346",
"closed_by": {
"link": "https://abcdev.service-now.com/api/now/table/sys_user/e4b0dd",
"value": "e4b0dd"
}
}
]
}
root
|-- result: struct (nullable = true)
| |-- closed_by: struct (nullable = true)
| | |-- link: string (nullable = true)
| | |-- value: string (nullable = true)
| |-- number: string (nullable = true)
| |-- parent: string (nullable = true)
| |-- promoted_by: string (nullable = true)
JSON2:
{
"result": [
{
"promoted_by": "",
"parent": {
"link": "https://abcdev.service-now.com/api/now/table/sys_user/ab00f1",
"value": "ab00f1"
},
"number": "310348",
"closed_by": ""
}
]
}
root
|-- result: struct (nullable = true)
| |-- closed_by: string (nullable = true)
| |-- number: string (nullable = true)
| |-- parent: struct (nullable = true)
| | |-- link: string (nullable = true)
| | |-- value: string (nullable = true)
| |-- promoted_by: string (nullable = true)
只需将 2 个 JSON 文件读入同一个 DataFrame。模式将由 Spark 自动合并。 closed_by
和 parent
列的类型都是 struct
:
df = spark.read.json("dbfs:/mnt/{json1.json,json2.json}", multiLine=True)
df.printSchema()
#root
# |-- result: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- closed_by: struct (nullable = true)
# | | | |-- link: string (nullable = true)
# | | | |-- value: string (nullable = true)
# | | |-- number: string (nullable = true)
# | | |-- parent: struct (nullable = true)
# | | | |-- link: string (nullable = true)
# | | | |-- value: string (nullable = true)
# | | |-- promoted_by: string (nullable = true)
要展平结构,请使用 explode
+ 星号展开结构:
from pyspark.sql import functions as F
df1 = df.select(F.explode("result").alias("results")).select("results.*") \
.select(
F.col("number"),
F.col("closed_by.value").alias("closed_by_value"),
F.col("closed_by.link").alias("closed_by_link"),
F.col("parent.value").alias("parent_value"),
F.col("parent.link").alias("parent_link"),
F.col("promoted_by")
)
df1.printSchema()
#root
# |-- number: string (nullable = true)
# |-- closed_by_value: string (nullable = true)
# |-- closed_by_link: string (nullable = true)
# |-- parent_value: string (nullable = true)
# |-- parent_link: string (nullable = true)
# |-- promoted_by: string (nullable = true)