如何解析具有嵌套模式的 json?
How to parse json having a nested schema?
让我的 json 的模式是:
root
|-- data: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
JSON是这样的
{
"data": [
[
10429183,
"4057F5BE-1933-415E-9AF7-D3CAAC5ED8E6",
10429183,
1454527245,
"386824",
1454527245,
"386824",
null,
"6702002",
"HM193685",
"2006-02-21T21:00:00",
"078XX S VERNON AVE",
"2092",
"NARCOTICS",
"SOLICIT NARCOTICS ON PUBLICWAY",
"STREET",
true,
false,
"0624",
"006",
"6",
"69",
"26",
null,
null,
"2006",
"2015-08-17T15:03:40",
null,
null,
[
null,
null,
null,
null,
null
]
]
]
}
val df2 =
df1
.withColumn("data", explode(array(jsonElements: _*)))
.withColumn("id", $"data" (0)).select("data.*")
错误:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Can
only star expand struct data types. Attribute: ArrayBuffer(data)
;
需要为每个数据元素创建一个数据框吗?
据我了解,您正试图将数组中的每个 json 元素拆分为单独的列...
一种方式如下
import org.apache.spark.sql._
object JsonTest extends App {
val jsonStr =
"""
|{
| "data": [
| [
| 10429183,
| "4057F5BE-1933-415E-9AF7-D3CAAC5ED8E6",
| 10429183,
| 1454527245,
| "386824",
| 1454527245,
| "386824",
| null,
| "6702002",
| "HM193685",
| "2006-02-21T21:00:00",
| "078XX S VERNON AVE",
| "2092",
| "NARCOTICS",
| "SOLICIT NARCOTICS ON PUBLICWAY",
| "STREET",
| true,
| false,
| "0624",
| "006",
| "6",
| "69",
| "26",
| null,
| null,
| "2006",
| "2015-08-17T15:03:40",
| null,
| null,
| [
| null,
| null,
| null,
| null,
| null
| ]
| ]
| ]
|}
""".stripMargin
private[this] implicit val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import org.apache.spark.sql.functions._
import spark.implicits._
val df1 = spark.read.json(Seq(jsonStr).toDS)
println("before explode")
df1.show(false)
println(df1.schema)
println("after explode")
// import org.apache.spark.sql.functions.schema_of_json
// val schema = df1.select(schema_of_json($"data")).as[String].first
// df1.withColumn("jsonData", from_json($"data", schema, Map[String, String]())).show
val df2 = df1
.withColumn("data", explode(col("data")))
println(df2.schema)
df2.show(false)
val nElements = 35
df2.select(Range(0, nElements).map(idx => $"data" (idx) as "data" + (idx + 2)): _*).show(false)
}
结果:
before explode
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|data |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[[10429183, 4057F5BE-1933-415E-9AF7-D3CAAC5ED8E6, 10429183, 1454527245, 386824, 1454527245, 386824,, 6702002, HM193685, 2006-02-21T21:00:00, 078XX S VERNON AVE, 2092, NARCOTICS, SOLICIT NARCOTICS ON PUBLICWAY, STREET, true, false, 0624, 006, 6, 69, 26,,, 2006, 2015-08-17T15:03:40,,, [null,null,null,null,null]]]|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
StructType(StructField(data,ArrayType(ArrayType(StringType,true),true),true))
after explode
StructType(StructField(data,ArrayType(StringType,true),true))
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|data |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[10429183, 4057F5BE-1933-415E-9AF7-D3CAAC5ED8E6, 10429183, 1454527245, 386824, 1454527245, 386824,, 6702002, HM193685, 2006-02-21T21:00:00, 078XX S VERNON AVE, 2092, NARCOTICS, SOLICIT NARCOTICS ON PUBLICWAY, STREET, true, false, 0624, 006, 6, 69, 26,,, 2006, 2015-08-17T15:03:40,,, [null,null,null,null,null]]|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+--------+------------------------------------+--------+----------+------+----------+------+-----+-------+--------+-------------------+------------------+------+---------+------------------------------+------+------+------+------+------+------+------+------+------+------+------+-------------------+------+------+--------------------------+------+------+------+------+------+
|data2 |data3 |data4 |data5 |data6 |data7 |data8 |data9|data10 |data11 |data12 |data13 |data14|data15 |data16 |data17|data18|data19|data20|data21|data22|data23|data24|data25|data26|data27|data28 |data29|data30|data31 |data32|data33|data34|data35|data36|
+--------+------------------------------------+--------+----------+------+----------+------+-----+-------+--------+-------------------+------------------+------+---------+------------------------------+------+------+------+------+------+------+------+------+------+------+------+-------------------+------+------+--------------------------+------+------+------+------+------+
|10429183|4057F5BE-1933-415E-9AF7-D3CAAC5ED8E6|10429183|1454527245|386824|1454527245|386824|null |6702002|HM193685|2006-02-21T21:00:00|078XX S VERNON AVE|2092 |NARCOTICS|SOLICIT NARCOTICS ON PUBLICWAY|STREET|true |false |0624 |006 |6 |69 |26 |null |null |2006 |2015-08-17T15:03:40|null |null |[null,null,null,null,null]|null |null |null |null |null |
+--------+------------------------------------+--------+----------+------+----------+------+-----+-------+--------+-------------------+------------------+------+---------+------------------------------+------+------+------+------+------+------+------+------+------+------+------+-------------------+------+------+--------------------------+------+------+------+------+------+
您可以使用 withColumn
更改列名,也可以删除不需要的列..
如果我没理解错的话,您是想将外部数组分解为一个新列 data
。然后将该数组的第一个值放入一个新字段 id
。如果是这种情况,那么下一个代码应该可以帮助您:
df.withColumn("data", explode($"data"))
.withColumn("id", $"data".getItem(0))
.show()
输出:
+--------------------+--------+
| data| id|
+--------------------+--------+
|[10429183, 4057F5...|10429183|
+--------------------+--------+
让我的 json 的模式是:
root
|-- data: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
JSON是这样的
{ "data": [ [ 10429183, "4057F5BE-1933-415E-9AF7-D3CAAC5ED8E6", 10429183, 1454527245, "386824", 1454527245, "386824", null, "6702002", "HM193685", "2006-02-21T21:00:00", "078XX S VERNON AVE", "2092", "NARCOTICS", "SOLICIT NARCOTICS ON PUBLICWAY", "STREET", true, false, "0624", "006", "6", "69", "26", null, null, "2006", "2015-08-17T15:03:40", null, null, [ null, null, null, null, null ] ] ] }
val df2 =
df1
.withColumn("data", explode(array(jsonElements: _*)))
.withColumn("id", $"data" (0)).select("data.*")
错误:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Can only star expand struct data types. Attribute:
ArrayBuffer(data)
;
需要为每个数据元素创建一个数据框吗?
据我了解,您正试图将数组中的每个 json 元素拆分为单独的列...
一种方式如下
import org.apache.spark.sql._
object JsonTest extends App {
val jsonStr =
"""
|{
| "data": [
| [
| 10429183,
| "4057F5BE-1933-415E-9AF7-D3CAAC5ED8E6",
| 10429183,
| 1454527245,
| "386824",
| 1454527245,
| "386824",
| null,
| "6702002",
| "HM193685",
| "2006-02-21T21:00:00",
| "078XX S VERNON AVE",
| "2092",
| "NARCOTICS",
| "SOLICIT NARCOTICS ON PUBLICWAY",
| "STREET",
| true,
| false,
| "0624",
| "006",
| "6",
| "69",
| "26",
| null,
| null,
| "2006",
| "2015-08-17T15:03:40",
| null,
| null,
| [
| null,
| null,
| null,
| null,
| null
| ]
| ]
| ]
|}
""".stripMargin
private[this] implicit val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import org.apache.spark.sql.functions._
import spark.implicits._
val df1 = spark.read.json(Seq(jsonStr).toDS)
println("before explode")
df1.show(false)
println(df1.schema)
println("after explode")
// import org.apache.spark.sql.functions.schema_of_json
// val schema = df1.select(schema_of_json($"data")).as[String].first
// df1.withColumn("jsonData", from_json($"data", schema, Map[String, String]())).show
val df2 = df1
.withColumn("data", explode(col("data")))
println(df2.schema)
df2.show(false)
val nElements = 35
df2.select(Range(0, nElements).map(idx => $"data" (idx) as "data" + (idx + 2)): _*).show(false)
}
结果:
before explode +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |data | +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |[[10429183, 4057F5BE-1933-415E-9AF7-D3CAAC5ED8E6, 10429183, 1454527245, 386824, 1454527245, 386824,, 6702002, HM193685, 2006-02-21T21:00:00, 078XX S VERNON AVE, 2092, NARCOTICS, SOLICIT NARCOTICS ON PUBLICWAY, STREET, true, false, 0624, 006, 6, 69, 26,,, 2006, 2015-08-17T15:03:40,,, [null,null,null,null,null]]]| +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ StructType(StructField(data,ArrayType(ArrayType(StringType,true),true),true)) after explode StructType(StructField(data,ArrayType(StringType,true),true)) +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |data | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |[10429183, 4057F5BE-1933-415E-9AF7-D3CAAC5ED8E6, 10429183, 1454527245, 386824, 1454527245, 386824,, 6702002, HM193685, 2006-02-21T21:00:00, 078XX S VERNON AVE, 2092, NARCOTICS, SOLICIT NARCOTICS ON PUBLICWAY, STREET, true, false, 0624, 006, 6, 69, 26,,, 2006, 2015-08-17T15:03:40,,, [null,null,null,null,null]]| +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +--------+------------------------------------+--------+----------+------+----------+------+-----+-------+--------+-------------------+------------------+------+---------+------------------------------+------+------+------+------+------+------+------+------+------+------+------+-------------------+------+------+--------------------------+------+------+------+------+------+ |data2 |data3 |data4 |data5 |data6 |data7 |data8 |data9|data10 |data11 |data12 |data13 |data14|data15 |data16 |data17|data18|data19|data20|data21|data22|data23|data24|data25|data26|data27|data28 |data29|data30|data31 |data32|data33|data34|data35|data36| +--------+------------------------------------+--------+----------+------+----------+------+-----+-------+--------+-------------------+------------------+------+---------+------------------------------+------+------+------+------+------+------+------+------+------+------+------+-------------------+------+------+--------------------------+------+------+------+------+------+ |10429183|4057F5BE-1933-415E-9AF7-D3CAAC5ED8E6|10429183|1454527245|386824|1454527245|386824|null |6702002|HM193685|2006-02-21T21:00:00|078XX S VERNON AVE|2092 |NARCOTICS|SOLICIT NARCOTICS ON PUBLICWAY|STREET|true |false |0624 |006 |6 |69 |26 |null |null |2006 |2015-08-17T15:03:40|null |null |[null,null,null,null,null]|null |null |null |null |null | +--------+------------------------------------+--------+----------+------+----------+------+-----+-------+--------+-------------------+------------------+------+---------+------------------------------+------+------+------+------+------+------+------+------+------+------+------+-------------------+------+------+--------------------------+------+------+------+------+------+
您可以使用 withColumn
更改列名,也可以删除不需要的列..
如果我没理解错的话,您是想将外部数组分解为一个新列 data
。然后将该数组的第一个值放入一个新字段 id
。如果是这种情况,那么下一个代码应该可以帮助您:
df.withColumn("data", explode($"data"))
.withColumn("id", $"data".getItem(0))
.show()
输出:
+--------------------+--------+
| data| id|
+--------------------+--------+
|[10429183, 4057F5...|10429183|
+--------------------+--------+