如何在 Spark RDD 中使用数组元素处理嵌套结构
How to Process Nested struct with Array element in Spark RDD
我正在使用 spark sql 嵌套数组的数据处理。
{
"isActive": true,
"sample": {
"someitem": {
"thesearecool": [{
"neat": "wow"
},
{
"neat": "tubular"
}
]
},
"coolcolors": [{
"color": "red",
"hex": "ff0000"
},
{
"color": "blue",
"hex": "0000ff"
}
]
}
}
架构:
root
|-- isActive: boolean (nullable = true)
|-- sample: struct (nullable = true)
| |-- coolcolors: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- color: string (nullable = true)
| | | |-- hex: string (nullable = true)
| |-- someitem: struct (nullable = true)
| | |-- thesearecool: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- neat: string (nullable = true)
代码:
val nested1 = nested.withColumn("color_data", explode($"sample.coolcolors")).select("isActive","color_data.color","color_data.hex","sample.someitem.thesearecool.neat")
val nested2 = nested.withColumn("thesearecool_data", explode($"sample.someitem.thesearecool")).select("thesearecool_data.neat")
示例输出:
+--------+-----+------+--------------+
|isActive|color|hex |neat |
+--------+-----+------+--------------+
|true |red |ff0000|[wow, tubular]|
|true |blue |0000ff|[wow, tubular]|
+--------+-----+------+--------------+
+-------+
|neat |
+-------+
|wow |
|tubular|
+-------+
我们需要处理数据单结果。
爆2次,select随心所欲
df.withColumn("coolcolors", explode($"sample.coolcolors"))
.withColumn("thesearecool", explode($"sample.someitem.thesearecool"))
.select("isActive", "coolcolors.color", "coolcolors.hex", "thesearecool.neat").show
然后,
+--------+-----+------+-------+
|isActive|color| hex| neat|
+--------+-----+------+-------+
| true| red|ff0000| wow|
| true| red|ff0000|tubular|
| true| blue|0000ff| wow|
| true| blue|0000ff|tubular|
+--------+-----+------+-------+
我正在使用 spark sql 嵌套数组的数据处理。
{
"isActive": true,
"sample": {
"someitem": {
"thesearecool": [{
"neat": "wow"
},
{
"neat": "tubular"
}
]
},
"coolcolors": [{
"color": "red",
"hex": "ff0000"
},
{
"color": "blue",
"hex": "0000ff"
}
]
}
}
架构:
root
|-- isActive: boolean (nullable = true)
|-- sample: struct (nullable = true)
| |-- coolcolors: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- color: string (nullable = true)
| | | |-- hex: string (nullable = true)
| |-- someitem: struct (nullable = true)
| | |-- thesearecool: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- neat: string (nullable = true)
代码:
val nested1 = nested.withColumn("color_data", explode($"sample.coolcolors")).select("isActive","color_data.color","color_data.hex","sample.someitem.thesearecool.neat")
val nested2 = nested.withColumn("thesearecool_data", explode($"sample.someitem.thesearecool")).select("thesearecool_data.neat")
示例输出:
+--------+-----+------+--------------+
|isActive|color|hex |neat |
+--------+-----+------+--------------+
|true |red |ff0000|[wow, tubular]|
|true |blue |0000ff|[wow, tubular]|
+--------+-----+------+--------------+
+-------+
|neat |
+-------+
|wow |
|tubular|
+-------+
我们需要处理数据单结果。
爆2次,select随心所欲
df.withColumn("coolcolors", explode($"sample.coolcolors"))
.withColumn("thesearecool", explode($"sample.someitem.thesearecool"))
.select("isActive", "coolcolors.color", "coolcolors.hex", "thesearecool.neat").show
然后,
+--------+-----+------+-------+
|isActive|color| hex| neat|
+--------+-----+------+-------+
| true| red|ff0000| wow|
| true| red|ff0000|tubular|
| true| blue|0000ff| wow|
| true| blue|0000ff|tubular|
+--------+-----+------+-------+