多行数据框爆炸列表列
Dataframe explode list columns in multiple rows
我有下面的数据框,每列包含一个相同大小的值列表
--------+--------------------+--------------------+--------------------+--------------
|Country_1| |Country_2| |Country_3| |Country_4|
+--------------------+--------------------+--------------------+--------------------+
|[1, 2, 3, 4, 5, 6 ] | [x1, x2, x3, x4, x5, x6 ]|[y1, y2, y3, y4, y5, y6 ] |[v1, v2, v3, v4, v5, v6 ]
我需要将每个元素列表转换成一行,以便进一步详细说明,根据我所看到的 post 我应该使用 explode
函数来结束不知何故如下:
Country_1 Country_2 Country_3 Country_4
1 x1 y1 v1
2 x2 y2 v2
3 x3 y3 v3
4 x4 y4 v4
5 x5 y5 v5
6 x6 y6 v6
我已经尝试了下面的代码,但到目前为止还没有成功。
data.withColumn("Country_1Country_2", F.arrays_zip("Country_1","Country_2")).select(*, F.explode("Country_1Country_2").alias("tCountry_1Country_2")).select(*, "tCountry_1Country_2.Country_1", col("Country_1Country_2.Country_2")).show()
# This is not part of the solution, just creation of the data sample
# df = spark.sql("select stack(1, array(1, 2, 3, 4, 5, 6) ,array('x1', 'x2', 'x3', 'x4', 'x5', 'x6') ,array('y1', 'y2', 'y3', 'y4', 'y5', 'y6') ,array('v1', 'v2', 'v3', 'v4', 'v5', 'v6')) as (Country_1, Country_2,Country_3,Country_4)")
df.selectExpr('inline(arrays_zip(*))').show()
+---------+---------+---------+---------+
|Country_1|Country_2|Country_3|Country_4|
+---------+---------+---------+---------+
| 1| x1| y1| v1|
| 2| x2| y2| v2|
| 3| x3| y3| v3|
| 4| x4| y4| v4|
| 5| x5| y5| v5|
| 6| x6| y6| v6|
+---------+---------+---------+---------+
我有下面的数据框,每列包含一个相同大小的值列表
--------+--------------------+--------------------+--------------------+--------------
|Country_1| |Country_2| |Country_3| |Country_4|
+--------------------+--------------------+--------------------+--------------------+
|[1, 2, 3, 4, 5, 6 ] | [x1, x2, x3, x4, x5, x6 ]|[y1, y2, y3, y4, y5, y6 ] |[v1, v2, v3, v4, v5, v6 ]
我需要将每个元素列表转换成一行,以便进一步详细说明,根据我所看到的 explode
函数来结束不知何故如下:
Country_1 Country_2 Country_3 Country_4
1 x1 y1 v1
2 x2 y2 v2
3 x3 y3 v3
4 x4 y4 v4
5 x5 y5 v5
6 x6 y6 v6
我已经尝试了下面的代码,但到目前为止还没有成功。
data.withColumn("Country_1Country_2", F.arrays_zip("Country_1","Country_2")).select(*, F.explode("Country_1Country_2").alias("tCountry_1Country_2")).select(*, "tCountry_1Country_2.Country_1", col("Country_1Country_2.Country_2")).show()
# This is not part of the solution, just creation of the data sample
# df = spark.sql("select stack(1, array(1, 2, 3, 4, 5, 6) ,array('x1', 'x2', 'x3', 'x4', 'x5', 'x6') ,array('y1', 'y2', 'y3', 'y4', 'y5', 'y6') ,array('v1', 'v2', 'v3', 'v4', 'v5', 'v6')) as (Country_1, Country_2,Country_3,Country_4)")
df.selectExpr('inline(arrays_zip(*))').show()
+---------+---------+---------+---------+
|Country_1|Country_2|Country_3|Country_4|
+---------+---------+---------+---------+
| 1| x1| y1| v1|
| 2| x2| y2| v2|
| 3| x3| y3| v3|
| 4| x4| y4| v4|
| 5| x5| y5| v5|
| 6| x6| y6| v6|
+---------+---------+---------+---------+