使用 apache spark 统一不同的 JSON
Uniformize disparate JSON with apache spark
示例:
这是 json 数据的示例,我们可以在其中看到具有不同属性的 json :
{"id": 1, "label": "tube", "length": "50m", "diameter": "5cm"}
{"id": 2, "label": "brick", "width": "10cm", "length": "25cm"}
{"id": 3, "label": "sand", "weight": "25kg"}
问:
是否可以像这样在 apache spark 的结构化数据集中转换此 json:
+--+-----+------+--------+-----+-------+
|id|label|length|diameter|width|weight |
+--+-----+-----------------------------+
|1 |tube |50m |5cm | | |
|2 |brick|25cm | |10cm | |
|3 |sand | | | |25kg |
+--+-----+------+--------+-----+-------+
没有问题。只需阅读它,让 Spark 推断模式:
val ds = Seq(
"""{"id": 1, "label": "tube", "length": "50m", "diameter": "5cm"}""",
"""{"id": 2, "label": "brick", "width": "10cm", "length": "25cm"}""",
"""{"id": 3, "label": "sand", "weight": "25kg"}"""
).toDS
spark.read.json(ds).show
// +--------+---+-----+------+------+-----+
// |diameter| id|label|length|weight|width|
// +--------+---+-----+------+------+-----+
// | 5cm| 1| tube| 50m| null| null|
// | null| 2|brick| 25cm| null| 10cm|
// | null| 3| sand| null| 25kg| null|
// +--------+---+-----+------+------+-----+
或在读取时提供预期架构:
import org.apache.spark.sql.types._
val fields = Seq("label", "length", "weight", "width")
val schema = StructType(
StructField("id", LongType) +: fields.map {
StructField(_, StringType)
}
)
spark.read.schema(schema).json(ds).show
// +---+-----+------+------+-----+
// | id|label|length|weight|width|
// +---+-----+------+------+-----+
// | 1| tube| 50m| null| null|
// | 2|brick| 25cm| null| 10cm|
// | 3| sand| null| 25kg| null|
// +---+-----+------+------+-----+
示例:
这是 json 数据的示例,我们可以在其中看到具有不同属性的 json :
{"id": 1, "label": "tube", "length": "50m", "diameter": "5cm"}
{"id": 2, "label": "brick", "width": "10cm", "length": "25cm"}
{"id": 3, "label": "sand", "weight": "25kg"}
问:
是否可以像这样在 apache spark 的结构化数据集中转换此 json:
+--+-----+------+--------+-----+-------+
|id|label|length|diameter|width|weight |
+--+-----+-----------------------------+
|1 |tube |50m |5cm | | |
|2 |brick|25cm | |10cm | |
|3 |sand | | | |25kg |
+--+-----+------+--------+-----+-------+
没有问题。只需阅读它,让 Spark 推断模式:
val ds = Seq(
"""{"id": 1, "label": "tube", "length": "50m", "diameter": "5cm"}""",
"""{"id": 2, "label": "brick", "width": "10cm", "length": "25cm"}""",
"""{"id": 3, "label": "sand", "weight": "25kg"}"""
).toDS
spark.read.json(ds).show
// +--------+---+-----+------+------+-----+
// |diameter| id|label|length|weight|width|
// +--------+---+-----+------+------+-----+
// | 5cm| 1| tube| 50m| null| null|
// | null| 2|brick| 25cm| null| 10cm|
// | null| 3| sand| null| 25kg| null|
// +--------+---+-----+------+------+-----+
或在读取时提供预期架构:
import org.apache.spark.sql.types._
val fields = Seq("label", "length", "weight", "width")
val schema = StructType(
StructField("id", LongType) +: fields.map {
StructField(_, StringType)
}
)
spark.read.schema(schema).json(ds).show
// +---+-----+------+------+-----+
// | id|label|length|weight|width|
// +---+-----+------+------+-----+
// | 1| tube| 50m| null| null|
// | 2|brick| 25cm| null| 10cm|
// | 3| sand| null| 25kg| null|
// +---+-----+------+------+-----+