json 路径在与 get_json_object 一起使用时无法按预期工作

json path not working as expected when used with get_json_object

TLDR; 以下 JSON 路径在与 pyspark.sql.functions.get_json_object.

一起使用时对我不起作用
$.Blocks[?(@.Type=='LINE')].Confidence

长版...

我想在一行中按数组分组

例如下面的结构

root
|--id: string
|--payload: string

payload 的值是一个字符串,表示 json 的一个块,看起来像下面的结构

{
        "Blocks": [
            {
                "Type": "LINE",
                "Confidence": 90
            },
            {
                "Type": "LINE",
                "Confidence": 98
            },
            {
                "Type": "WORD",
                "Confidence": 99
            },
            {
                "Type": "PAGE",
                "Confidence": 97
            },
            {
                "Type": "PAGE",
                "Confidence": 89
            },
            {
                "Type": "WORD",
                "Confidence": 99
            }
        ]
    }

我想按类型汇总所有置信度,以便我们得到以下新列...

{
    "id": 12345,
    "payload": "..."
    "confidence": [
        {
            "Type": "WORD",
            "Confidence": [
                99,
                99
            ]
        },
        {
            "Type": "PAGE",
            "Confidence": [
                97,
                89
            ]
        },
        {
            "Type": "LINE",
            "Confidence": [
                90,
                98
            ]
        }
    ]
}

为此,我计划使用 get_json_object(...) 为每种类型的块提取置信度。

例如...

get_json_object(col("payload"), "$.Blocks[?(@.Type=='LINE')].Confidence")

但是 $.Blocks[?(@.Type=='LINE')].Confidence 一直返回 null。这是为什么?

我通过针对上面的示例 payload json 在 https://jsonpath.curiousconcept.com/# 上进行测试来验证 json 路径是否有效,并得到以下结果...

[
   90,
   98
]

如果使用上面的路径不是一种选择,人们将如何聚合它?

下面是完整的代码示例。我希望第一个 .show() 在置信度栏中打印出 [90, 98]

from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StringType, StructType, IntegerType
from pyspark.sql.functions import get_json_object, col


def main():
    spark = SparkSession.builder.appName('test_session').getOrCreate()
    df = spark.createDataFrame([
        (
            12345,  # id
            """
{
        "Blocks": [
            {
                "Type": "LINE",
                "Confidence": 90
            },
            {
                "Type": "LINE",
                "Confidence": 98
            },
            {
                "Type": "WORD",
                "Confidence": 99
            },
            {
                "Type": "PAGE",
                "Confidence": 97
            },
            {
                "Type": "PAGE",
                "Confidence": 89
            },
            {
                "Type": "WORD",
                "Confidence": 99
            }
        ]
    }

            """  # payload
        )
    ],
        StructType(
            [
                StructField("id", IntegerType(), True),
                StructField("payload", StringType(), True)
            ])
    )
    
    # this prints out null (why?)
    df.withColumn("confidence", get_json_object(col("payload"), "$.Blocks[?(@.Type=='LINE')].Confidence")).show()
    
    # this prints out the correct values, [90,98,99,97,89,99]
    df.withColumn("confidence", get_json_object(col("payload"), "$.Blocks[*].Confidence")).show()


if __name__ == "__main__":
    main()

没有官方文档说明Spark如何解析JSON路径,但是基于its source code,貌似不支持@作为当前对象。事实上它支持的语法非常有限:

// parse `[*]` and `[123]` subscripts
// parse `.name` or `['name']` child expressions
// child wildcards: `..`, `.*` or `['*']`

因此,如果您对另一种方法持开放态度,这里有预定义的架构和函数,例如 from_jsonexplodecollect_list:

schema = T.StructType([
    T.StructField('Blocks', T.ArrayType(T.StructType([
        T.StructField('Type', T.StringType()),
        T.StructField('Confidence', T.IntegerType())
    ])))
])

(df
    .withColumn('json', F.from_json('payload', schema))
    .withColumn('block', F.explode('json.blocks'))
    .select('id', 'block.*')
    .groupBy('id', 'Type')
    .agg(F.collect_list('Confidence').alias('confidence'))
    .show(10, False)
)

# +-----+----+----------+
# |id   |Type|confidence|
# +-----+----+----------+
# |12345|PAGE|[97, 89]  |
# |12345|WORD|[99, 99]  |
# |12345|LINE|[90, 98]  |
# +-----+----+----------+