如何在不对 Scala 中的列名称进行硬编码的情况下取消透视 Spark DataFrame?
How to unpivot Spark DataFrame without hardcoding column names in Scala?
假设你有
val df = Seq(("Jack", 91, 86), ("Mike", 79, 85), ("Julia", 93, 70)).toDF("Name", "Maths", "Art")
给出:
+-----+-----+---+
| Name|Maths|Art|
+-----+-----+---+
| Jack| 91| 86|
| Mike| 79| 85|
|Julia| 93| 70|
+-----+-----+---+
现在您想通过以下方式反转它:
df.select($"Name", expr("stack(2, 'Maths', Maths, 'Art', Art) as (Subject, Score)"))
给出:
+-----+-------+-----+
| Name|Subject|Score|
+-----+-------+-----+
| Jack| Maths| 91|
| Jack| Art| 86|
| Mike| Maths| 79|
| Mike| Art| 85|
|Julia| Maths| 93|
|Julia| Art| 70|
+-----+-------+-----+
到此为止!现在,如果您不知道列名列表怎么办?如果列名列表很长或者它可以更改怎么办?我们怎样才能避免像那样愚蠢地对列名进行硬编码?
或者像这样也不错:
// fake code
df.select($"Name", unpivot(df.columns.diff("Name")) as ("Subject", "Score"))
为什么我们没有这样的api?
通过使用 .mkString
多分隔符我们可以创建表达式并在表达式中使用它。
示例:
df.show()
//+-----+-----+---+
//| Name|Maths|Art|
//+-----+-----+---+
//| Jack| 91| 86|
//| Mike| 79| 85|
//|Julia| 93| 70|
//+-----+-----+---+
//filtering required cols
val cols=df.columns.filter(_.toLowerCase != "name")
//defining alias cols string
val alias_cols="Subject,Score"
//mkString with 3 seperators
val stack_exp=cols.map(x => s"""'${x}',${x}""").mkString(s"stack(${cols.size},",",",s""") as (${alias_cols})""")
df.select($"Name", expr(s"""${stack_exp}""")).show()
//+-----+-------+-----+
//| Name|Subject|Score|
//+-----+-------+-----+
//| Jack| Maths| 91|
//| Jack| Art| 86|
//| Mike| Maths| 79|
//| Mike| Art| 85|
//|Julia| Maths| 93|
//|Julia| Art| 70|
//+-----+-------+-----+
这确实很好用:
def melt(preserves: Seq[String], toMelt: Seq[String], column: String = "variable", row: String = "value", df: DataFrame) : DataFrame = {
val _vars_and_vals = array((for (c <- toMelt) yield { struct(lit(c).alias(column), col(c).alias(row)) }): _*)
val _tmp = df.withColumn("_vars_and_vals", explode(_vars_and_vals))
val cols = preserves.map(col _) ++ { for (x <- List(column, row)) yield { col("_vars_and_vals")(x).alias(x) }}
_tmp.select(cols: _*)
}
来源:
感谢@user10938362
假设你有
val df = Seq(("Jack", 91, 86), ("Mike", 79, 85), ("Julia", 93, 70)).toDF("Name", "Maths", "Art")
给出:
+-----+-----+---+
| Name|Maths|Art|
+-----+-----+---+
| Jack| 91| 86|
| Mike| 79| 85|
|Julia| 93| 70|
+-----+-----+---+
现在您想通过以下方式反转它:
df.select($"Name", expr("stack(2, 'Maths', Maths, 'Art', Art) as (Subject, Score)"))
给出:
+-----+-------+-----+
| Name|Subject|Score|
+-----+-------+-----+
| Jack| Maths| 91|
| Jack| Art| 86|
| Mike| Maths| 79|
| Mike| Art| 85|
|Julia| Maths| 93|
|Julia| Art| 70|
+-----+-------+-----+
到此为止!现在,如果您不知道列名列表怎么办?如果列名列表很长或者它可以更改怎么办?我们怎样才能避免像那样愚蠢地对列名进行硬编码?
或者像这样也不错:
// fake code
df.select($"Name", unpivot(df.columns.diff("Name")) as ("Subject", "Score"))
为什么我们没有这样的api?
通过使用 .mkString
多分隔符我们可以创建表达式并在表达式中使用它。
示例:
df.show()
//+-----+-----+---+
//| Name|Maths|Art|
//+-----+-----+---+
//| Jack| 91| 86|
//| Mike| 79| 85|
//|Julia| 93| 70|
//+-----+-----+---+
//filtering required cols
val cols=df.columns.filter(_.toLowerCase != "name")
//defining alias cols string
val alias_cols="Subject,Score"
//mkString with 3 seperators
val stack_exp=cols.map(x => s"""'${x}',${x}""").mkString(s"stack(${cols.size},",",",s""") as (${alias_cols})""")
df.select($"Name", expr(s"""${stack_exp}""")).show()
//+-----+-------+-----+
//| Name|Subject|Score|
//+-----+-------+-----+
//| Jack| Maths| 91|
//| Jack| Art| 86|
//| Mike| Maths| 79|
//| Mike| Art| 85|
//|Julia| Maths| 93|
//|Julia| Art| 70|
//+-----+-------+-----+
这确实很好用:
def melt(preserves: Seq[String], toMelt: Seq[String], column: String = "variable", row: String = "value", df: DataFrame) : DataFrame = {
val _vars_and_vals = array((for (c <- toMelt) yield { struct(lit(c).alias(column), col(c).alias(row)) }): _*)
val _tmp = df.withColumn("_vars_and_vals", explode(_vars_and_vals))
val cols = preserves.map(col _) ++ { for (x <- List(column, row)) yield { col("_vars_and_vals")(x).alias(x) }}
_tmp.select(cols: _*)
}
来源: