如何在 pyspark 中为多个不同的列转置数据
How to transpose data in pyspark for multiple different columns
我正在尝试转置 pyspark 中的数据。我能够使用单个列进行转置。但是,对于多列,我不确定如何将参数传递给 explode 函数。
输入格式:
输出格式:
有人可以用任何例子或参考来提示我吗?提前致谢。
使用stack
转置如下(spark>=2.4
)-
加载测试数据
val data =
"""
|PersonId | Education1CollegeName | Education1Degree | Education2CollegeName | Education2Degree |Education3CollegeName | Education3Degree
| 1 | xyz | MS | abc | Phd | pqr | BS
| 2 | POR | MS | ABC | Phd | null | null
""".stripMargin
val stringDS1 = data.split(System.lineSeparator())
.map(_.split("\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString("|"))
.toSeq.toDS()
val df1 = spark.read
.option("sep", "|")
.option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS1)
df1.show(false)
df1.printSchema()
/**
* +--------+---------------------+----------------+---------------------+----------------+---------------------+----------------+
* |PersonId|Education1CollegeName|Education1Degree|Education2CollegeName|Education2Degree|Education3CollegeName|Education3Degree|
* +--------+---------------------+----------------+---------------------+----------------+---------------------+----------------+
* |1 |xyz |MS |abc |Phd |pqr |BS |
* |2 |POR |MS |ABC |Phd |null |null |
* +--------+---------------------+----------------+---------------------+----------------+---------------------+----------------+
*
* root
* |-- PersonId: integer (nullable = true)
* |-- Education1CollegeName: string (nullable = true)
* |-- Education1Degree: string (nullable = true)
* |-- Education2CollegeName: string (nullable = true)
* |-- Education2Degree: string (nullable = true)
* |-- Education3CollegeName: string (nullable = true)
* |-- Education3Degree: string (nullable = true)
*/
使用堆栈取消 table 的旋转
df1.selectExpr("PersonId",
"stack(3, Education1CollegeName, Education1Degree, Education2CollegeName, Education2Degree, " +
"Education3CollegeName, Education3Degree) as (CollegeName, EducationDegree)")
.where("CollegeName is not null and EducationDegree is not null")
.show(false)
/**
* +--------+-----------+---------------+
* |PersonId|CollegeName|EducationDegree|
* +--------+-----------+---------------+
* |1 |xyz |MS |
* |1 |abc |Phd |
* |1 |pqr |BS |
* |2 |POR |MS |
* |2 |ABC |Phd |
* +--------+-----------+---------------+
*/
一个干净的 PySpark 版本
from pyspark.sql import functions as F
df_a = spark.createDataFrame([(1,'xyz','MS','abc','Phd','pqr','BS'),(2,"POR","MS","ABC","Phd","","")],[
"id","Education1CollegeName","Education1Degree","Education2CollegeName","Education2Degree","Education3CollegeName","Education3Degree"])
+---+---------------------+----------------+---------------------+----------------+---------------------+----------------+
| id|Education1CollegeName|Education1Degree|Education2CollegeName|Education2Degree|Education3CollegeName|Education3Degree|
+---+---------------------+----------------+---------------------+----------------+---------------------+----------------+
| 1| xyz| MS| abc| Phd| pqr| BS|
| 2| POR| MS| ABC| Phd| | |
+---+---------------------+----------------+---------------------+----------------+---------------------+----------------+
代码-
df = df_a.selectExpr("id", "stack(3, Education1CollegeName, Education1Degree,Education2CollegeName, Education2Degree,Education3CollegeName, Education3Degree) as (B, C)")
+---+---+---+
| id| B| C|
+---+---+---+
| 1|xyz| MS|
| 1|abc|Phd|
| 1|pqr| BS|
| 2|POR| MS|
| 2|ABC|Phd|
| 2| | |
+---+---+---+
我正在尝试转置 pyspark 中的数据。我能够使用单个列进行转置。但是,对于多列,我不确定如何将参数传递给 explode 函数。
输入格式:
输出格式:
有人可以用任何例子或参考来提示我吗?提前致谢。
使用stack
转置如下(spark>=2.4
)-
加载测试数据
val data =
"""
|PersonId | Education1CollegeName | Education1Degree | Education2CollegeName | Education2Degree |Education3CollegeName | Education3Degree
| 1 | xyz | MS | abc | Phd | pqr | BS
| 2 | POR | MS | ABC | Phd | null | null
""".stripMargin
val stringDS1 = data.split(System.lineSeparator())
.map(_.split("\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString("|"))
.toSeq.toDS()
val df1 = spark.read
.option("sep", "|")
.option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS1)
df1.show(false)
df1.printSchema()
/**
* +--------+---------------------+----------------+---------------------+----------------+---------------------+----------------+
* |PersonId|Education1CollegeName|Education1Degree|Education2CollegeName|Education2Degree|Education3CollegeName|Education3Degree|
* +--------+---------------------+----------------+---------------------+----------------+---------------------+----------------+
* |1 |xyz |MS |abc |Phd |pqr |BS |
* |2 |POR |MS |ABC |Phd |null |null |
* +--------+---------------------+----------------+---------------------+----------------+---------------------+----------------+
*
* root
* |-- PersonId: integer (nullable = true)
* |-- Education1CollegeName: string (nullable = true)
* |-- Education1Degree: string (nullable = true)
* |-- Education2CollegeName: string (nullable = true)
* |-- Education2Degree: string (nullable = true)
* |-- Education3CollegeName: string (nullable = true)
* |-- Education3Degree: string (nullable = true)
*/
使用堆栈取消 table 的旋转
df1.selectExpr("PersonId",
"stack(3, Education1CollegeName, Education1Degree, Education2CollegeName, Education2Degree, " +
"Education3CollegeName, Education3Degree) as (CollegeName, EducationDegree)")
.where("CollegeName is not null and EducationDegree is not null")
.show(false)
/**
* +--------+-----------+---------------+
* |PersonId|CollegeName|EducationDegree|
* +--------+-----------+---------------+
* |1 |xyz |MS |
* |1 |abc |Phd |
* |1 |pqr |BS |
* |2 |POR |MS |
* |2 |ABC |Phd |
* +--------+-----------+---------------+
*/
一个干净的 PySpark 版本
from pyspark.sql import functions as F
df_a = spark.createDataFrame([(1,'xyz','MS','abc','Phd','pqr','BS'),(2,"POR","MS","ABC","Phd","","")],[
"id","Education1CollegeName","Education1Degree","Education2CollegeName","Education2Degree","Education3CollegeName","Education3Degree"])
+---+---------------------+----------------+---------------------+----------------+---------------------+----------------+
| id|Education1CollegeName|Education1Degree|Education2CollegeName|Education2Degree|Education3CollegeName|Education3Degree|
+---+---------------------+----------------+---------------------+----------------+---------------------+----------------+
| 1| xyz| MS| abc| Phd| pqr| BS|
| 2| POR| MS| ABC| Phd| | |
+---+---------------------+----------------+---------------------+----------------+---------------------+----------------+
代码-
df = df_a.selectExpr("id", "stack(3, Education1CollegeName, Education1Degree,Education2CollegeName, Education2Degree,Education3CollegeName, Education3Degree) as (B, C)")
+---+---+---+
| id| B| C|
+---+---+---+
| 1|xyz| MS|
| 1|abc|Phd|
| 1|pqr| BS|
| 2|POR| MS|
| 2|ABC|Phd|
| 2| | |
+---+---+---+