火花归一化数组的数据帧
spark normalize data frame of arrays
如何规范化主要由嵌套数组组成的 spark 数据框?
case class FooBar(id:String, foo:Seq[String], bar:String, baz: Seq[String])
val f = Seq(FooBar("thinga", Seq("1 "), "1 2 3 ", Seq("2 ")),
FooBar("thinga", Seq("1 2 3 4 "), " 0 0 0 ", Seq("2 3 4 5 ")),
FooBar("thingb", Seq("1 2 "), "1 2 3 4 5 ", Seq("1 2 ")),
FooBar("thingb", Seq("0 ", "0 ", "0 "), "1 2 3 4 5 ", Seq("1 2 3 "))).toDS
f.printSchema
f.show(false)
+------+------------+----------+----------+
| id| foo| bar| baz|
+------+------------+----------+----------+
|thinga| [1 ]| 1 2 3 | [2 ]|
|thinga| [1 2 3 4 ]| 0 0 0 |[2 3 4 5 ]|
|thingb| [1 2 ]|1 2 3 4 5 | [1 2 ]|
|thingb|[0 , 0 , 0 ]|1 2 3 4 5 | [1 2 3 ]|
+------+------------+----------+----------+
scala> f.printSchema
root
|-- id: string (nullable = true)
|-- foo: array (nullable = true)
| |-- element: string (containsNull = true)
|-- bar: string (nullable = true)
|-- baz: array (nullable = true)
| |-- element: string (containsNull = true)
我想要类似 explode 的东西,它将保留 (id, foo, bar, baz) 的架构,但 return 为数组的每个值保留一个单独的记录。最终结果不应再包含数组。
Foo 和 baz 是相关的。他们的顺序不能被扭曲。它们始终具有相同的长度,并且 foo 的第一个值与 baz 的第一个值相关 - 依此类推。也许我应该先将它们组合成一个列/结构?
最终结果应该类似于:
+------+------------+----------+----------+
| id| foo| bar| baz|
+------+------------+----------+----------+
|thinga| 1 | 1 | 2 |
|thinga| 1 | 2 | 2 |
|thinga| 1 | 3 | 2 |
|thinga| 1 | 0 |2 |
|thinga| 2 | 0 |3 |
|thinga| 3 | 0 |4 |
|thinga| 4 | 0 |5 |
|thinga| 1 | 0 |2 |
|thinga| 2 | 0 |3 |
|thinga| 3 | 0 |4 |
|thinga| 4 | 0 |5 |
|thinga| 1 | 0 |2 |
|thinga| 2 | 0 |3 |
|thinga| 3 | 0 |4 |
|thinga| 4 | 0 |5 |
....
|thingb|0 |1 | 1 |
|thingb|0 |2 | 2 |
|thingb|0 |3 | 3 |
|thingb|0 |4 | 1 |
|thingb|0 |5 | 2 |
|thingb|0 |1 | 3 |
|thingb|0 |2 | 1 |
|thingb|0 |3 | 2 |
|thingb|0 |4 | 3 |
|thingb|0 |5 | 1 |
|thingb|0 |1 | 2 |
|thingb|0 |2 | 3 |
|thingb|0 |3 | 1 |
|thingb|0 |4 | 2 |
|thingb|0 |5 | 3 |
+------+------------+----------+----------+
编辑
部分相关问题
-
根据我们的讨论(请查看初始post下的评论),以下数据应该是有效的:
+------+-------+---------+-------+
| id| foo| bar| baz|
+------+-------+---------+-------+
|thinga| 1| 1 2 3| 2|
|thinga|1 2 3 4| 0 0 0|2 3 4 5|
|thingb| 1 2|1 2 3 4 5| 1 2|
|thingb| 0 0 0|1 2 3 4 5| 1 2 3|
+------+-------+---------+-------+
然后初始化代码如下:
case class FooBar(id:String, foo:String, bar:String, baz: String)
val f = Seq(FooBar("thinga", "1", "1 2 3", "2"),
FooBar("thinga", "1 2 3 4", "0 0 0", "2 3 4 5"),
FooBar("thingb", "1 2", "1 2 3 4 5", "1 2"),
FooBar("thingb", "0 0 0", "1 2 3 4 5", "1 2 3")).toDS()
如果是这种情况,那么这段代码应该会产生理想的结果:
f
.withColumn("foo", split($"foo", " "))
.withColumn("baz", split($"baz", " "))
.withColumn("bar", explode(split($"bar", " ")))
.map { case Row(id: String, foo: Seq[String], bar: String, baz: Seq[String]) =>
val c = for ((f, b) <- foo.zip(baz)) yield {
(f, b)
}
(id, bar, c)
}.toDF(cols: _*)
.withColumn("foo+baz", explode($"foo+baz"))
.withColumn("foo", $"foo+baz._1")
.withColumn("baz", $"foo+baz._2")
.drop($"foo+bar")
.select("id", "foo", "bar", "baz")
.show(100)
前两次转换将拆分 space 分隔的列 foo 和 baz。由于列 bar 是字符串,我们需要使用 split 将其转换为数组,然后将其分解。 Map 将 return 一个 (id, bar, c) 的元组,其中 c 是一个元组序列 (foo, bar)。映射后我们得到下一个输出:
+------+---+--------------------+
| id|bar| foo+baz|
+------+---+--------------------+
|thinga| 1| [[1,2]]|
|thinga| 2| [[1,2]]|
|thinga| 3| [[1,2]]|
|thinga| 0|[[1,2], [2,3], [3...|
|thinga| 0|[[1,2], [2,3], [3...|
|thinga| 0|[[1,2], [2,3], [3...|
|thingb| 1| [[1,1], [2,2]]|
|thingb| 2| [[1,1], [2,2]]|
|thingb| 3| [[1,1], [2,2]]|
|thingb| 4| [[1,1], [2,2]]|
|thingb| 5| [[1,1], [2,2]]|
|thingb| 1|[[0,1], [0,2], [0...|
|thingb| 2|[[0,1], [0,2], [0...|
|thingb| 3|[[0,1], [0,2], [0...|
|thingb| 4|[[0,1], [0,2], [0...|
|thingb| 5|[[0,1], [0,2], [0...|
+------+---+--------------------+
接下来我们再用 "foo+baz" 展开一次以提取最终的元组。现在输出如下所示:
+------+---+-------+
| id|bar|foo+baz|
+------+---+-------+
|thinga| 1| [1,2]|
|thinga| 2| [1,2]|
|thinga| 3| [1,2]|
|thinga| 0| [1,2]|
|thinga| 0| [2,3]|
|thinga| 0| [3,4]|
|thinga| 0| [4,5]|
|thinga| 0| [1,2]|
.....
|thingb| 1| [0,2]|
|thingb| 1| [0,3]|
|thingb| 2| [0,1]|
|thingb| 2| [0,2]|
|thingb| 2| [0,3]|
|thingb| 3| [0,1]|
|thingb| 3| [0,2]|
|thingb| 3| [0,3]|
|thingb| 4| [0,1]|
|thingb| 4| [0,2]|
|thingb| 4| [0,3]|
|thingb| 5| [0,1]|
|thingb| 5| [0,2]|
|thingb| 5| [0,3]|
+------+---+-------+
最后,我们填充 foo 和 baz 列 foo+baz._1 和foo+baz._2 分别。这将是最终输出:
+------+---+---+---+
| id|foo|bar|baz|
+------+---+---+---+
|thinga| 1| 1| 2|
|thinga| 1| 2| 2|
|thinga| 1| 3| 2|
|thinga| 1| 0| 2|
|thinga| 2| 0| 3|
|thinga| 3| 0| 4|
|thinga| 4| 0| 5|
|thinga| 1| 0| 2|
|thinga| 2| 0| 3|
|thinga| 3| 0| 4|
|thinga| 4| 0| 5|
|thinga| 1| 0| 2|
|thinga| 2| 0| 3|
|thinga| 3| 0| 4|
|thinga| 4| 0| 5|
|thingb| 1| 1| 1|
|thingb| 2| 1| 2|
|thingb| 1| 2| 1|
|thingb| 2| 2| 2|
|thingb| 1| 3| 1|
|thingb| 2| 3| 2|
|thingb| 1| 4| 1|
|thingb| 2| 4| 2|
|thingb| 1| 5| 1|
|thingb| 2| 5| 2|
|thingb| 0| 1| 1|
|thingb| 0| 1| 2|
|thingb| 0| 1| 3|
|thingb| 0| 2| 1|
|thingb| 0| 2| 2|
|thingb| 0| 2| 3|
|thingb| 0| 3| 1|
|thingb| 0| 3| 2|
|thingb| 0| 3| 3|
|thingb| 0| 4| 1|
|thingb| 0| 4| 2|
|thingb| 0| 4| 3|
|thingb| 0| 5| 1|
|thingb| 0| 5| 2|
|thingb| 0| 5| 3|
+------+---+---+---+
如何规范化主要由嵌套数组组成的 spark 数据框?
case class FooBar(id:String, foo:Seq[String], bar:String, baz: Seq[String])
val f = Seq(FooBar("thinga", Seq("1 "), "1 2 3 ", Seq("2 ")),
FooBar("thinga", Seq("1 2 3 4 "), " 0 0 0 ", Seq("2 3 4 5 ")),
FooBar("thingb", Seq("1 2 "), "1 2 3 4 5 ", Seq("1 2 ")),
FooBar("thingb", Seq("0 ", "0 ", "0 "), "1 2 3 4 5 ", Seq("1 2 3 "))).toDS
f.printSchema
f.show(false)
+------+------------+----------+----------+
| id| foo| bar| baz|
+------+------------+----------+----------+
|thinga| [1 ]| 1 2 3 | [2 ]|
|thinga| [1 2 3 4 ]| 0 0 0 |[2 3 4 5 ]|
|thingb| [1 2 ]|1 2 3 4 5 | [1 2 ]|
|thingb|[0 , 0 , 0 ]|1 2 3 4 5 | [1 2 3 ]|
+------+------------+----------+----------+
scala> f.printSchema
root
|-- id: string (nullable = true)
|-- foo: array (nullable = true)
| |-- element: string (containsNull = true)
|-- bar: string (nullable = true)
|-- baz: array (nullable = true)
| |-- element: string (containsNull = true)
我想要类似 explode 的东西,它将保留 (id, foo, bar, baz) 的架构,但 return 为数组的每个值保留一个单独的记录。最终结果不应再包含数组。
Foo 和 baz 是相关的。他们的顺序不能被扭曲。它们始终具有相同的长度,并且 foo 的第一个值与 baz 的第一个值相关 - 依此类推。也许我应该先将它们组合成一个列/结构?
最终结果应该类似于:
+------+------------+----------+----------+
| id| foo| bar| baz|
+------+------------+----------+----------+
|thinga| 1 | 1 | 2 |
|thinga| 1 | 2 | 2 |
|thinga| 1 | 3 | 2 |
|thinga| 1 | 0 |2 |
|thinga| 2 | 0 |3 |
|thinga| 3 | 0 |4 |
|thinga| 4 | 0 |5 |
|thinga| 1 | 0 |2 |
|thinga| 2 | 0 |3 |
|thinga| 3 | 0 |4 |
|thinga| 4 | 0 |5 |
|thinga| 1 | 0 |2 |
|thinga| 2 | 0 |3 |
|thinga| 3 | 0 |4 |
|thinga| 4 | 0 |5 |
....
|thingb|0 |1 | 1 |
|thingb|0 |2 | 2 |
|thingb|0 |3 | 3 |
|thingb|0 |4 | 1 |
|thingb|0 |5 | 2 |
|thingb|0 |1 | 3 |
|thingb|0 |2 | 1 |
|thingb|0 |3 | 2 |
|thingb|0 |4 | 3 |
|thingb|0 |5 | 1 |
|thingb|0 |1 | 2 |
|thingb|0 |2 | 3 |
|thingb|0 |3 | 1 |
|thingb|0 |4 | 2 |
|thingb|0 |5 | 3 |
+------+------------+----------+----------+
编辑
部分相关问题
-
根据我们的讨论(请查看初始post下的评论),以下数据应该是有效的:
+------+-------+---------+-------+
| id| foo| bar| baz|
+------+-------+---------+-------+
|thinga| 1| 1 2 3| 2|
|thinga|1 2 3 4| 0 0 0|2 3 4 5|
|thingb| 1 2|1 2 3 4 5| 1 2|
|thingb| 0 0 0|1 2 3 4 5| 1 2 3|
+------+-------+---------+-------+
然后初始化代码如下:
case class FooBar(id:String, foo:String, bar:String, baz: String)
val f = Seq(FooBar("thinga", "1", "1 2 3", "2"),
FooBar("thinga", "1 2 3 4", "0 0 0", "2 3 4 5"),
FooBar("thingb", "1 2", "1 2 3 4 5", "1 2"),
FooBar("thingb", "0 0 0", "1 2 3 4 5", "1 2 3")).toDS()
如果是这种情况,那么这段代码应该会产生理想的结果:
f
.withColumn("foo", split($"foo", " "))
.withColumn("baz", split($"baz", " "))
.withColumn("bar", explode(split($"bar", " ")))
.map { case Row(id: String, foo: Seq[String], bar: String, baz: Seq[String]) =>
val c = for ((f, b) <- foo.zip(baz)) yield {
(f, b)
}
(id, bar, c)
}.toDF(cols: _*)
.withColumn("foo+baz", explode($"foo+baz"))
.withColumn("foo", $"foo+baz._1")
.withColumn("baz", $"foo+baz._2")
.drop($"foo+bar")
.select("id", "foo", "bar", "baz")
.show(100)
前两次转换将拆分 space 分隔的列 foo 和 baz。由于列 bar 是字符串,我们需要使用 split 将其转换为数组,然后将其分解。 Map 将 return 一个 (id, bar, c) 的元组,其中 c 是一个元组序列 (foo, bar)。映射后我们得到下一个输出:
+------+---+--------------------+
| id|bar| foo+baz|
+------+---+--------------------+
|thinga| 1| [[1,2]]|
|thinga| 2| [[1,2]]|
|thinga| 3| [[1,2]]|
|thinga| 0|[[1,2], [2,3], [3...|
|thinga| 0|[[1,2], [2,3], [3...|
|thinga| 0|[[1,2], [2,3], [3...|
|thingb| 1| [[1,1], [2,2]]|
|thingb| 2| [[1,1], [2,2]]|
|thingb| 3| [[1,1], [2,2]]|
|thingb| 4| [[1,1], [2,2]]|
|thingb| 5| [[1,1], [2,2]]|
|thingb| 1|[[0,1], [0,2], [0...|
|thingb| 2|[[0,1], [0,2], [0...|
|thingb| 3|[[0,1], [0,2], [0...|
|thingb| 4|[[0,1], [0,2], [0...|
|thingb| 5|[[0,1], [0,2], [0...|
+------+---+--------------------+
接下来我们再用 "foo+baz" 展开一次以提取最终的元组。现在输出如下所示:
+------+---+-------+
| id|bar|foo+baz|
+------+---+-------+
|thinga| 1| [1,2]|
|thinga| 2| [1,2]|
|thinga| 3| [1,2]|
|thinga| 0| [1,2]|
|thinga| 0| [2,3]|
|thinga| 0| [3,4]|
|thinga| 0| [4,5]|
|thinga| 0| [1,2]|
.....
|thingb| 1| [0,2]|
|thingb| 1| [0,3]|
|thingb| 2| [0,1]|
|thingb| 2| [0,2]|
|thingb| 2| [0,3]|
|thingb| 3| [0,1]|
|thingb| 3| [0,2]|
|thingb| 3| [0,3]|
|thingb| 4| [0,1]|
|thingb| 4| [0,2]|
|thingb| 4| [0,3]|
|thingb| 5| [0,1]|
|thingb| 5| [0,2]|
|thingb| 5| [0,3]|
+------+---+-------+
最后,我们填充 foo 和 baz 列 foo+baz._1 和foo+baz._2 分别。这将是最终输出:
+------+---+---+---+
| id|foo|bar|baz|
+------+---+---+---+
|thinga| 1| 1| 2|
|thinga| 1| 2| 2|
|thinga| 1| 3| 2|
|thinga| 1| 0| 2|
|thinga| 2| 0| 3|
|thinga| 3| 0| 4|
|thinga| 4| 0| 5|
|thinga| 1| 0| 2|
|thinga| 2| 0| 3|
|thinga| 3| 0| 4|
|thinga| 4| 0| 5|
|thinga| 1| 0| 2|
|thinga| 2| 0| 3|
|thinga| 3| 0| 4|
|thinga| 4| 0| 5|
|thingb| 1| 1| 1|
|thingb| 2| 1| 2|
|thingb| 1| 2| 1|
|thingb| 2| 2| 2|
|thingb| 1| 3| 1|
|thingb| 2| 3| 2|
|thingb| 1| 4| 1|
|thingb| 2| 4| 2|
|thingb| 1| 5| 1|
|thingb| 2| 5| 2|
|thingb| 0| 1| 1|
|thingb| 0| 1| 2|
|thingb| 0| 1| 3|
|thingb| 0| 2| 1|
|thingb| 0| 2| 2|
|thingb| 0| 2| 3|
|thingb| 0| 3| 1|
|thingb| 0| 3| 2|
|thingb| 0| 3| 3|
|thingb| 0| 4| 1|
|thingb| 0| 4| 2|
|thingb| 0| 4| 3|
|thingb| 0| 5| 1|
|thingb| 0| 5| 2|
|thingb| 0| 5| 3|
+------+---+---+---+