如何在不对 Scala 中的列名称进行硬编码的情况下取消透视 Spark DataFrame?

How to unpivot Spark DataFrame without hardcoding column names in Scala?

假设你有

val df = Seq(("Jack", 91, 86), ("Mike", 79, 85), ("Julia", 93, 70)).toDF("Name", "Maths", "Art")

给出:

+-----+-----+---+
| Name|Maths|Art|
+-----+-----+---+
| Jack|   91| 86|
| Mike|   79| 85|
|Julia|   93| 70|
+-----+-----+---+

现在您想通过以下方式反转它:

df.select($"Name", expr("stack(2, 'Maths', Maths, 'Art', Art) as (Subject, Score)"))

给出:

+-----+-------+-----+
| Name|Subject|Score|
+-----+-------+-----+
| Jack|  Maths|   91|
| Jack|    Art|   86|
| Mike|  Maths|   79|
| Mike|    Art|   85|
|Julia|  Maths|   93|
|Julia|    Art|   70|
+-----+-------+-----+

到此为止!现在,如果您不知道列名列表怎么办?如果列名列表很长或者它可以更改怎么办?我们怎样才能避免像那样愚蠢地对列名进行硬编码?

或者像这样也不错:

// fake code
df.select($"Name", unpivot(df.columns.diff("Name")) as ("Subject", "Score"))

为什么我们没有这样的api?

通过使用 .mkString 多分隔符我们可以创建表达式并在表达式中使用它。

示例:

df.show()
//+-----+-----+---+
//| Name|Maths|Art|
//+-----+-----+---+
//| Jack|   91| 86|
//| Mike|   79| 85|
//|Julia|   93| 70|
//+-----+-----+---+

//filtering required cols
val cols=df.columns.filter(_.toLowerCase != "name")

//defining alias cols string
val alias_cols="Subject,Score"

//mkString with 3 seperators
val stack_exp=cols.map(x => s"""'${x}',${x}""").mkString(s"stack(${cols.size},",",",s""") as (${alias_cols})""")

df.select($"Name", expr(s"""${stack_exp}""")).show()
//+-----+-------+-----+
//| Name|Subject|Score|
//+-----+-------+-----+
//| Jack|  Maths|   91|
//| Jack|    Art|   86|
//| Mike|  Maths|   79|
//| Mike|    Art|   85|
//|Julia|  Maths|   93|
//|Julia|    Art|   70|
//+-----+-------+-----+

这确实很好用:

def melt(preserves: Seq[String], toMelt: Seq[String], column: String = "variable", row: String = "value", df: DataFrame) : DataFrame = {
    val _vars_and_vals = array((for (c <- toMelt) yield { struct(lit(c).alias(column), col(c).alias(row)) }): _*)
    val _tmp = df.withColumn("_vars_and_vals", explode(_vars_and_vals))
    val cols = preserves.map(col _) ++ { for (x <- List(column, row)) yield { col("_vars_and_vals")(x).alias(x) }}
    _tmp.select(cols: _*)
}

来源: 感谢@user10938362