如何更改 RDD/Spark Dataframe 的结构?
How to change structure of RDD/Spark Dataframe?
从这种 rdd/spark 数据框出发最简单的过程是什么:
date
Tokyo
New York
01/01
1
2
02/01
3
2
03/01
4
5
下面这个表格中的相同数据?
city
date
value
Tokyo
01/01
1
New York
01/01
2
Tokyo
02/01
3
New York
02/01
2
Tokyo
03/01
4
New York
03/01
5
我会用 pyspark sql
使用 create_map
和 explode
等函数来解决这个问题
如下-
from pyspark.sql import functions as func
df1= df.withColumn('mapCol',
func.create_map(func.lit('Tokyo'),df.Tokyo,
func.lit('New York'),df["New York"]
)
)
res = df1.select('*',func.explode(df1.mapCol).alias('city','value')).drop("Tokyo", "New York", "mapCol")
res.show()
输出:
+-----+--------+-----+
| date| city|value|
+-----+--------+-----+
|01/01| Tokyo| 1|
|01/01|New York| 2|
|02/01| Tokyo| 3|
|02/01|New York| 2|
|03/01| Tokyo| 4|
|03/01|New York| 5|
+-----+--------+-----+
有一个更简单的解决方案,使用 stack
apache-spark-sql
with t (date, Tokyo, `New York`) as (select stack(3 ,'01/01',1,2 ,'02/01',3,2 ,'03/01',4,5))
-- The solution starts here
select date, stack(2, 'Tokyo',Tokyo,'New York',`New York`) as (city,value)
from t
date
city
value
01/01
Tokyo
1
01/01
New York
2
02/01
Tokyo
3
02/01
New York
2
03/01
Tokyo
4
03/01
New York
5
pyspark
df = spark.sql("select stack(3 ,'01/01',1,2 ,'02/01',3,2 ,'03/01',4,5) as (date, Tokyo, `New York`)")
#The solution starts here
df.selectExpr("date", "stack(2, 'Tokyo',Tokyo,'New York',`New York`) as (city,value)").show()
+-----+--------+-----+
| date| city|value|
+-----+--------+-----+
|01/01| Tokyo| 1|
|01/01|New York| 2|
|02/01| Tokyo| 3|
|02/01|New York| 2|
|03/01| Tokyo| 4|
|03/01|New York| 5|
+-----+--------+-----+
从这种 rdd/spark 数据框出发最简单的过程是什么:
date | Tokyo | New York |
---|---|---|
01/01 | 1 | 2 |
02/01 | 3 | 2 |
03/01 | 4 | 5 |
下面这个表格中的相同数据?
city | date | value |
---|---|---|
Tokyo | 01/01 | 1 |
New York | 01/01 | 2 |
Tokyo | 02/01 | 3 |
New York | 02/01 | 2 |
Tokyo | 03/01 | 4 |
New York | 03/01 | 5 |
我会用 pyspark sql
使用 create_map
和 explode
如下-
from pyspark.sql import functions as func
df1= df.withColumn('mapCol',
func.create_map(func.lit('Tokyo'),df.Tokyo,
func.lit('New York'),df["New York"]
)
)
res = df1.select('*',func.explode(df1.mapCol).alias('city','value')).drop("Tokyo", "New York", "mapCol")
res.show()
输出:
+-----+--------+-----+
| date| city|value|
+-----+--------+-----+
|01/01| Tokyo| 1|
|01/01|New York| 2|
|02/01| Tokyo| 3|
|02/01|New York| 2|
|03/01| Tokyo| 4|
|03/01|New York| 5|
+-----+--------+-----+
有一个更简单的解决方案,使用 stack
apache-spark-sql
with t (date, Tokyo, `New York`) as (select stack(3 ,'01/01',1,2 ,'02/01',3,2 ,'03/01',4,5))
-- The solution starts here
select date, stack(2, 'Tokyo',Tokyo,'New York',`New York`) as (city,value)
from t
date | city | value |
---|---|---|
01/01 | Tokyo | 1 |
01/01 | New York | 2 |
02/01 | Tokyo | 3 |
02/01 | New York | 2 |
03/01 | Tokyo | 4 |
03/01 | New York | 5 |
pyspark
df = spark.sql("select stack(3 ,'01/01',1,2 ,'02/01',3,2 ,'03/01',4,5) as (date, Tokyo, `New York`)")
#The solution starts here
df.selectExpr("date", "stack(2, 'Tokyo',Tokyo,'New York',`New York`) as (city,value)").show()
+-----+--------+-----+
| date| city|value|
+-----+--------+-----+
|01/01| Tokyo| 1|
|01/01|New York| 2|
|02/01| Tokyo| 3|
|02/01|New York| 2|
|03/01| Tokyo| 4|
|03/01|New York| 5|
+-----+--------+-----+