如何在 pyspark 中为多个不同的列转置数据

Question

我正在尝试转置 pyspark 中的数据。我能够使用单个列进行转置。但是，对于多列，我不确定如何将参数传递给 explode 函数。

输入格式：

输出格式：

有人可以用任何例子或参考来提示我吗？提前致谢。

Answer 1

使用stack转置如下(spark>=2.4)-

加载测试数据

val data =
      """
        |PersonId | Education1CollegeName | Education1Degree | Education2CollegeName | Education2Degree |Education3CollegeName | Education3Degree
        | 1 | xyz | MS | abc | Phd | pqr | BS
        |  2 | POR | MS | ABC | Phd | null | null
      """.stripMargin
    val stringDS1 = data.split(System.lineSeparator())
      .map(_.split("\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString("|"))
      .toSeq.toDS()
    val df1 = spark.read
      .option("sep", "|")
      .option("inferSchema", "true")
      .option("header", "true")
      .option("nullValue", "null")
      .csv(stringDS1)
    df1.show(false)
    df1.printSchema()

    /**
      * +--------+---------------------+----------------+---------------------+----------------+---------------------+----------------+
      * |PersonId|Education1CollegeName|Education1Degree|Education2CollegeName|Education2Degree|Education3CollegeName|Education3Degree|
      * +--------+---------------------+----------------+---------------------+----------------+---------------------+----------------+
      * |1       |xyz                  |MS              |abc                  |Phd             |pqr                  |BS              |
      * |2       |POR                  |MS              |ABC                  |Phd             |null                 |null            |
      * +--------+---------------------+----------------+---------------------+----------------+---------------------+----------------+
      *
      * root
      * |-- PersonId: integer (nullable = true)
      * |-- Education1CollegeName: string (nullable = true)
      * |-- Education1Degree: string (nullable = true)
      * |-- Education2CollegeName: string (nullable = true)
      * |-- Education2Degree: string (nullable = true)
      * |-- Education3CollegeName: string (nullable = true)
      * |-- Education3Degree: string (nullable = true)
      */

使用堆栈取消 table 的旋转


    df1.selectExpr("PersonId",
      "stack(3, Education1CollegeName, Education1Degree, Education2CollegeName, Education2Degree, " +
        "Education3CollegeName, Education3Degree) as (CollegeName, EducationDegree)")
      .where("CollegeName is not null and EducationDegree is not null")
      .show(false)

    /**
      * +--------+-----------+---------------+
      * |PersonId|CollegeName|EducationDegree|
      * +--------+-----------+---------------+
      * |1       |xyz        |MS             |
      * |1       |abc        |Phd            |
      * |1       |pqr        |BS             |
      * |2       |POR        |MS             |
      * |2       |ABC        |Phd            |
      * +--------+-----------+---------------+
      */

Answer 2

一个干净的 PySpark 版本

from pyspark.sql import functions as F
df_a = spark.createDataFrame([(1,'xyz','MS','abc','Phd','pqr','BS'),(2,"POR","MS","ABC","Phd","","")],[ 

"id","Education1CollegeName","Education1Degree","Education2CollegeName","Education2Degree","Education3CollegeName","Education3Degree"])

+---+---------------------+----------------+---------------------+----------------+---------------------+----------------+
| id|Education1CollegeName|Education1Degree|Education2CollegeName|Education2Degree|Education3CollegeName|Education3Degree|

    +---+---------------------+----------------+---------------------+----------------+---------------------+----------------+
    |  1|                  xyz|              MS|                  abc|             Phd|                  pqr|              BS|
    |  2|                  POR|              MS|                  ABC|             Phd|                     |                |
    +---+---------------------+----------------+---------------------+----------------+---------------------+----------------+

代码-

df = df_a.selectExpr("id", "stack(3, Education1CollegeName, Education1Degree,Education2CollegeName, Education2Degree,Education3CollegeName, Education3Degree) as (B, C)")

+---+---+---+
| id|  B|  C|
+---+---+---+
|  1|xyz| MS|
|  1|abc|Phd|
|  1|pqr| BS|
|  2|POR| MS|
|  2|ABC|Phd|
|  2|   |   |
+---+---+---+

如何在 pyspark 中为多个不同的列转置数据

How to transpose data in pyspark for multiple different columns

transpose

explode

pyspark

加载测试数据

使用堆栈取消 table 的旋转