使用 apache spark 统一不同的 JSON

Uniformize disparate JSON with apache spark

示例:

这是 json 数据的示例,我们可以在其中看到具有不同属性的 json :

{"id": 1, "label": "tube", "length": "50m", "diameter": "5cm"}
{"id": 2, "label": "brick", "width": "10cm", "length": "25cm"}
{"id": 3, "label": "sand", "weight": "25kg"}

问:

是否可以像这样在 apache spark 的结构化数据集中转换此 json:

+--+-----+------+--------+-----+-------+
|id|label|length|diameter|width|weight |
+--+-----+-----------------------------+
|1 |tube |50m   |5cm     |     |       |
|2 |brick|25cm  |        |10cm |       |
|3 |sand |      |        |     |25kg   |
+--+-----+------+--------+-----+-------+

没有问题。只需阅读它,让 Spark 推断模式:

val ds = Seq(
  """{"id": 1, "label": "tube", "length": "50m", "diameter": "5cm"}""", 
  """{"id": 2, "label": "brick", "width": "10cm", "length": "25cm"}""",
  """{"id": 3, "label": "sand", "weight": "25kg"}"""
).toDS

spark.read.json(ds).show
// +--------+---+-----+------+------+-----+
// |diameter| id|label|length|weight|width|
// +--------+---+-----+------+------+-----+
// |     5cm|  1| tube|   50m|  null| null|
// |    null|  2|brick|  25cm|  null| 10cm|
// |    null|  3| sand|  null|  25kg| null|
// +--------+---+-----+------+------+-----+

或在读取时提供预期架构:

import org.apache.spark.sql.types._

val fields = Seq("label", "length", "weight", "width")

val schema = StructType(
  StructField("id", LongType) +: fields.map {
    StructField(_, StringType)
  }
)

spark.read.schema(schema).json(ds).show
// +---+-----+------+------+-----+
// | id|label|length|weight|width|
// +---+-----+------+------+-----+
// |  1| tube|   50m|  null| null|
// |  2|brick|  25cm|  null| 10cm|
// |  3| sand|  null|  25kg| null|
// +---+-----+------+------+-----+