如何读取 JSON 数组结构中的字符串值？

Question

这是我的代码：

df_05_body = spark.sql("""
     select 
    gtin
    , principalBody.constituents
 from 
v_df_04""")

df_05_body.createOrReplaceTempView("v_df_05_body")

df_05_body.printSchema()

这是架构：

root
 |-- gtin: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- constituents: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- constituentCategory: struct (nullable = true)
 |    |    |    |    |-- value: string (nullable = true)
 |    |    |    |    |-- valueRange: string (nullable = true)

如何更改 SQL 中的 principalBody.constituents 行以读取字段 constituentCategory.value 和 constituentCategory.valueRange？

Answer 1

列constituents是结构数组的数组。如果您的目的是获得平面结构，那么您需要展平嵌套数组，然后展开：

df_05_body = spark.sql("""
    WITH
      v_df_04_exploded AS (
      SELECT
        gtin,
        explode(flatten(principalBody.constituents)) AS constituent
      FROM
        v_df_04 )

    SELECT
      gtin,
      constituent.constituentCategory.value,
      constituent.constituentCategory.valueRange
    FROM
      v_df_04_exploded
""")

或者像这样在 flatten 之后简单地使用 inline：

df_05_body = spark.sql("""
    SELECT
      gtin,
      inline(flatten(principalBody.constituents))
    FROM
      v_df_04_exploded
""")

如何读取 JSON 数组结构中的字符串值？

How to read a string value in JSON array struct?

json

apache-spark

apache-spark-sql

pyspark