json 路径在与 get_json_object 一起使用时无法按预期工作
json path not working as expected when used with get_json_object
TLDR; 以下 JSON 路径在与 pyspark.sql.functions.get_json_object
.
一起使用时对我不起作用
$.Blocks[?(@.Type=='LINE')].Confidence
长版...
我想在一行中按数组分组
例如下面的结构
root
|--id: string
|--payload: string
payload
的值是一个字符串,表示 json 的一个块,看起来像下面的结构
{
"Blocks": [
{
"Type": "LINE",
"Confidence": 90
},
{
"Type": "LINE",
"Confidence": 98
},
{
"Type": "WORD",
"Confidence": 99
},
{
"Type": "PAGE",
"Confidence": 97
},
{
"Type": "PAGE",
"Confidence": 89
},
{
"Type": "WORD",
"Confidence": 99
}
]
}
我想按类型汇总所有置信度,以便我们得到以下新列...
{
"id": 12345,
"payload": "..."
"confidence": [
{
"Type": "WORD",
"Confidence": [
99,
99
]
},
{
"Type": "PAGE",
"Confidence": [
97,
89
]
},
{
"Type": "LINE",
"Confidence": [
90,
98
]
}
]
}
为此,我计划使用 get_json_object(...)
为每种类型的块提取置信度。
例如...
get_json_object(col("payload"), "$.Blocks[?(@.Type=='LINE')].Confidence")
但是 $.Blocks[?(@.Type=='LINE')].Confidence
一直返回 null
。这是为什么?
我通过针对上面的示例 payload
json 在 https://jsonpath.curiousconcept.com/# 上进行测试来验证 json 路径是否有效,并得到以下结果...
[
90,
98
]
如果使用上面的路径不是一种选择,人们将如何聚合它?
下面是完整的代码示例。我希望第一个 .show()
在置信度栏中打印出 [90, 98]
。
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StringType, StructType, IntegerType
from pyspark.sql.functions import get_json_object, col
def main():
spark = SparkSession.builder.appName('test_session').getOrCreate()
df = spark.createDataFrame([
(
12345, # id
"""
{
"Blocks": [
{
"Type": "LINE",
"Confidence": 90
},
{
"Type": "LINE",
"Confidence": 98
},
{
"Type": "WORD",
"Confidence": 99
},
{
"Type": "PAGE",
"Confidence": 97
},
{
"Type": "PAGE",
"Confidence": 89
},
{
"Type": "WORD",
"Confidence": 99
}
]
}
""" # payload
)
],
StructType(
[
StructField("id", IntegerType(), True),
StructField("payload", StringType(), True)
])
)
# this prints out null (why?)
df.withColumn("confidence", get_json_object(col("payload"), "$.Blocks[?(@.Type=='LINE')].Confidence")).show()
# this prints out the correct values, [90,98,99,97,89,99]
df.withColumn("confidence", get_json_object(col("payload"), "$.Blocks[*].Confidence")).show()
if __name__ == "__main__":
main()
没有官方文档说明Spark如何解析JSON路径,但是基于its source code,貌似不支持@
作为当前对象。事实上它支持的语法非常有限:
// parse `[*]` and `[123]` subscripts
// parse `.name` or `['name']` child expressions
// child wildcards: `..`, `.*` or `['*']`
因此,如果您对另一种方法持开放态度,这里有预定义的架构和函数,例如 from_json
、explode
、collect_list
:
schema = T.StructType([
T.StructField('Blocks', T.ArrayType(T.StructType([
T.StructField('Type', T.StringType()),
T.StructField('Confidence', T.IntegerType())
])))
])
(df
.withColumn('json', F.from_json('payload', schema))
.withColumn('block', F.explode('json.blocks'))
.select('id', 'block.*')
.groupBy('id', 'Type')
.agg(F.collect_list('Confidence').alias('confidence'))
.show(10, False)
)
# +-----+----+----------+
# |id |Type|confidence|
# +-----+----+----------+
# |12345|PAGE|[97, 89] |
# |12345|WORD|[99, 99] |
# |12345|LINE|[90, 98] |
# +-----+----+----------+
TLDR; 以下 JSON 路径在与 pyspark.sql.functions.get_json_object
.
$.Blocks[?(@.Type=='LINE')].Confidence
长版...
我想在一行中按数组分组
例如下面的结构
root
|--id: string
|--payload: string
payload
的值是一个字符串,表示 json 的一个块,看起来像下面的结构
{
"Blocks": [
{
"Type": "LINE",
"Confidence": 90
},
{
"Type": "LINE",
"Confidence": 98
},
{
"Type": "WORD",
"Confidence": 99
},
{
"Type": "PAGE",
"Confidence": 97
},
{
"Type": "PAGE",
"Confidence": 89
},
{
"Type": "WORD",
"Confidence": 99
}
]
}
我想按类型汇总所有置信度,以便我们得到以下新列...
{
"id": 12345,
"payload": "..."
"confidence": [
{
"Type": "WORD",
"Confidence": [
99,
99
]
},
{
"Type": "PAGE",
"Confidence": [
97,
89
]
},
{
"Type": "LINE",
"Confidence": [
90,
98
]
}
]
}
为此,我计划使用 get_json_object(...)
为每种类型的块提取置信度。
例如...
get_json_object(col("payload"), "$.Blocks[?(@.Type=='LINE')].Confidence")
但是 $.Blocks[?(@.Type=='LINE')].Confidence
一直返回 null
。这是为什么?
我通过针对上面的示例 payload
json 在 https://jsonpath.curiousconcept.com/# 上进行测试来验证 json 路径是否有效,并得到以下结果...
[
90,
98
]
如果使用上面的路径不是一种选择,人们将如何聚合它?
下面是完整的代码示例。我希望第一个 .show()
在置信度栏中打印出 [90, 98]
。
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StringType, StructType, IntegerType
from pyspark.sql.functions import get_json_object, col
def main():
spark = SparkSession.builder.appName('test_session').getOrCreate()
df = spark.createDataFrame([
(
12345, # id
"""
{
"Blocks": [
{
"Type": "LINE",
"Confidence": 90
},
{
"Type": "LINE",
"Confidence": 98
},
{
"Type": "WORD",
"Confidence": 99
},
{
"Type": "PAGE",
"Confidence": 97
},
{
"Type": "PAGE",
"Confidence": 89
},
{
"Type": "WORD",
"Confidence": 99
}
]
}
""" # payload
)
],
StructType(
[
StructField("id", IntegerType(), True),
StructField("payload", StringType(), True)
])
)
# this prints out null (why?)
df.withColumn("confidence", get_json_object(col("payload"), "$.Blocks[?(@.Type=='LINE')].Confidence")).show()
# this prints out the correct values, [90,98,99,97,89,99]
df.withColumn("confidence", get_json_object(col("payload"), "$.Blocks[*].Confidence")).show()
if __name__ == "__main__":
main()
没有官方文档说明Spark如何解析JSON路径,但是基于its source code,貌似不支持@
作为当前对象。事实上它支持的语法非常有限:
// parse `[*]` and `[123]` subscripts
// parse `.name` or `['name']` child expressions
// child wildcards: `..`, `.*` or `['*']`
因此,如果您对另一种方法持开放态度,这里有预定义的架构和函数,例如 from_json
、explode
、collect_list
:
schema = T.StructType([
T.StructField('Blocks', T.ArrayType(T.StructType([
T.StructField('Type', T.StringType()),
T.StructField('Confidence', T.IntegerType())
])))
])
(df
.withColumn('json', F.from_json('payload', schema))
.withColumn('block', F.explode('json.blocks'))
.select('id', 'block.*')
.groupBy('id', 'Type')
.agg(F.collect_list('Confidence').alias('confidence'))
.show(10, False)
)
# +-----+----+----------+
# |id |Type|confidence|
# +-----+----+----------+
# |12345|PAGE|[97, 89] |
# |12345|WORD|[99, 99] |
# |12345|LINE|[90, 98] |
# +-----+----+----------+