Select 基于使用 Scala 的 Spark Dataframe 中另一列的值的列

Question

我有一个包含 5 列的数据框 - sourceId、score_1、score_3、score_4 和 score_7。 sourceId 列的值可以是 [1, 3, 4, 7]。我想将其转换为另一个包含 sourceId 和 score 列的数据框，其中分数取决于 sourceId 列的值。

sourceId	score_1	score_3	score_4	score_7
1	0.3	0.7	0.45	0.21
4	0.15	0.66	0.73	0.47
7	0.34	0.41	0.78	0.16
3	0.77	0.1	0.93	0.67

所以如果 sourceId = 1，我们 select 值 score_1 该记录，如果 sourceId = 3，我们 select 值 score_3 ,等等...

结果会是

sourceId	score
1	0.3
4	0.73
7	0.16
3	0.1

在 Spark 中执行此操作的最佳方法是什么？

Answer 1

在 id 列值上链接多个 when 表达式：

val ids = Seq(1, 3, 4, 7)

val scoreCol = ids.foldLeft(lit(null)) { case (acc, id) =>
  when(col("sourceId")===id, col(s"score_$id")).otherwise(acc)
}

val df2 = df.withColumn("score", scoreCol)

或者从 score_* 列构建映射表达式并使用它来获取 score 值：

val scoreMap = map(
  df.columns
    .filter(_.startsWith("score_"))
    .flatMap(c => Seq(lit(c.split("_")(1)), col(c))): _*
)

val df2 = df.withColumn("score", scoreMap(col("sourceId")))

Select 基于使用 Scala 的 Spark Dataframe 中另一列的值的列

Select a column based on another column's value in Spark Dataframe using Scala

scala

dataframe

apache-spark

apache-spark-sql