如何更改 RDD/Spark Dataframe 的结构?

How to change structure of RDD/Spark Dataframe?

从这种 rdd/spark 数据框出发最简单的过程是什么:

date Tokyo New York
01/01 1 2
02/01 3 2
03/01 4 5

下面这个表格中的相同数据?

city date value
Tokyo 01/01 1
New York 01/01 2
Tokyo 02/01 3
New York 02/01 2
Tokyo 03/01 4
New York 03/01 5

我会用 pyspark sql 使用 create_mapexplode

等函数来解决这个问题

如下-

from pyspark.sql import functions as func

df1= df.withColumn('mapCol',
                    func.create_map(func.lit('Tokyo'),df.Tokyo,
                                    func.lit('New York'),df["New York"]
                                   ) 
                  )

res = df1.select('*',func.explode(df1.mapCol).alias('city','value')).drop("Tokyo", "New York", "mapCol")
res.show()

输出:

+-----+--------+-----+
| date|    city|value|
+-----+--------+-----+
|01/01|   Tokyo|    1|
|01/01|New York|    2|
|02/01|   Tokyo|    3|
|02/01|New York|    2|
|03/01|   Tokyo|    4|
|03/01|New York|    5|
+-----+--------+-----+

有一个更简单的解决方案,使用 stack

apache-spark-sql

with t (date, Tokyo, `New York`) as (select stack(3 ,'01/01',1,2 ,'02/01',3,2 ,'03/01',4,5))

-- The solution starts here

select date, stack(2, 'Tokyo',Tokyo,'New York',`New York`) as (city,value)
from   t
date city value
01/01 Tokyo 1
01/01 New York 2
02/01 Tokyo 3
02/01 New York 2
03/01 Tokyo 4
03/01 New York 5

pyspark

df = spark.sql("select stack(3 ,'01/01',1,2 ,'02/01',3,2 ,'03/01',4,5) as (date, Tokyo, `New York`)")

#The solution starts here
df.selectExpr("date", "stack(2, 'Tokyo',Tokyo,'New York',`New York`) as (city,value)").show()

+-----+--------+-----+
| date|    city|value|
+-----+--------+-----+
|01/01|   Tokyo|    1|
|01/01|New York|    2|
|02/01|   Tokyo|    3|
|02/01|New York|    2|
|03/01|   Tokyo|    4|
|03/01|New York|    5|
+-----+--------+-----+