如何在 Spark RDD 中使用数组元素处理嵌套结构

How to Process Nested struct with Array element in Spark RDD

我正在使用 spark sql 嵌套数组的数据处理。

{
        "isActive": true,
        "sample": {
            "someitem": {
                "thesearecool": [{
                        "neat": "wow"
                    },
                    {
                        "neat": "tubular"
                    }
                ]
            },
            "coolcolors": [{
                    "color": "red",
                    "hex": "ff0000"
                },
                {
                    "color": "blue",
                    "hex": "0000ff"
                }
            ]
        }
    }

架构:

root
     |-- isActive: boolean (nullable = true)
     |-- sample: struct (nullable = true)
     |    |-- coolcolors: array (nullable = true)
     |    |    |-- element: struct (containsNull = true)
     |    |    |    |-- color: string (nullable = true)
     |    |    |    |-- hex: string (nullable = true)
     |    |-- someitem: struct (nullable = true)
     |    |    |-- thesearecool: array (nullable = true)
     |    |    |    |-- element: struct (containsNull = true)
     |    |    |    |    |-- neat: string (nullable = true)

代码:

val nested1 = nested.withColumn("color_data", explode($"sample.coolcolors")).select("isActive","color_data.color","color_data.hex","sample.someitem.thesearecool.neat")
            val nested2 = nested.withColumn("thesearecool_data", explode($"sample.someitem.thesearecool")).select("thesearecool_data.neat")

示例输出:

+--------+-----+------+--------------+
|isActive|color|hex   |neat          |
+--------+-----+------+--------------+
|true    |red  |ff0000|[wow, tubular]|
|true    |blue |0000ff|[wow, tubular]|
+--------+-----+------+--------------+

+-------+
|neat   |
+-------+
|wow    |
|tubular|
+-------+

我们需要处理数据单结果。

爆2次,select随心所欲

df.withColumn("coolcolors", explode($"sample.coolcolors"))
  .withColumn("thesearecool", explode($"sample.someitem.thesearecool"))
  .select("isActive", "coolcolors.color", "coolcolors.hex", "thesearecool.neat").show

然后,

+--------+-----+------+-------+
|isActive|color|   hex|   neat|
+--------+-----+------+-------+
|    true|  red|ff0000|    wow|
|    true|  red|ff0000|tubular|
|    true| blue|0000ff|    wow|
|    true| blue|0000ff|tubular|
+--------+-----+------+-------+