如何访问scala中spark数据框的列索引进行计算

how to access the column index for spark dataframe in scala for calculation

我是 Scala 编程的新手,我在 R 方面的工作非常广泛,但是在为 Scala 工作时,很难在循环中工作以提取特定列以对列值执行计算

让我用一个例子来解释一下:

我在加入 2 个数据帧后到达了最终数据帧, 现在我需要执行像

这样的计算

以上是参考列的计算,所以计算后我们将得到下面的spark dataframe

如何在 scala 中引用 for 循环中的列索引来计算 spark 数据帧中的新列值

这是一种解决方案:

Input Data:
+---+---+---+---+---+---+---+---+---+
|a1 |b1 |c1 |d1 |e1 |a2 |b2 |c2 |d2 |
+---+---+---+---+---+---+---+---+---+
|24 |74 |74 |21 |66 |65 |100|27 |19 |
+---+---+---+---+---+---+---+---+---+

压缩列以删除不匹配的列:

val oneCols = data.schema.filter(_.name.contains("1")).map(x => x.name).sorted
val twoCols = data.schema.filter(_.name.contains("2")).map(x => x.name).sorted
val cols = oneCols.zip(twoCols) 

//cols: Seq[(String, String)] = List((a1,a2), (b1,b2), (c1,c2), (d1,d2))

使用foldLeft函数动态添加列:

import org.apache.spark.sql.functions._
val result = cols.foldLeft(data)((data,c) => data.withColumn(s"Diff_${c._1}",
                                                  (col(s"${lit(c._2)}") - col(s"${lit(c._1)}"))/col(s"${lit(c._2)}")))

结果如下:

result.show(false)  

+---+---+---+---+---+---+---+---+---+------------------+-------+-------------------+--------------------+
|a1 |b1 |c1 |d1 |e1 |a2 |b2 |c2 |d2 |Diff_a1           |Diff_b1|Diff_c1            |Diff_d1             |
+---+---+---+---+---+---+---+---+---+------------------+-------+-------------------+--------------------+
|24 |74 |74 |21 |66 |65 |100|27 |19 |0.6307692307692307|0.26   |-1.7407407407407407|-0.10526315789473684|
+---+---+---+---+---+---+---+---+---+------------------+-------+-------------------+--------------------+