参考另一个数组列的 Spark 数据帧聚合

Question

我在通过查看索引数组来聚合双精度数组时遇到了与性能相关的问题。我的意思是。原始数据框看起来像这样：

original Dataframe 

| id | prop1        | values                  |
|----|--------------|-------------------------|
|  1 | [2,5,1,3]    |   [ 0.1, 0.5, 0.7, 0.8] |
|  2 | [2,1]        |   [ 0.2, 0.3 ]          |
|  1 | [1,5]        |   [ 0.4, 0.3 ]          |
|  2 | [3,2]        |   [ 0.0, 0.1 ]          |

so in the column 2 which is prop1 is an int array having values within range of 1 to 5 but not in a order and there can be missing numbers within array.

Prop1 int 数组就像双精度数组值的索引我的意思是第 1 行在展开后看起来像下面这样

| id | prop1 | values |
|----|-------|--------|
|  1 | 2     |   0.1  |
|  1 | 5     |   0.5  |
|  1 | 1     |   0.7  |
|  1 | 3     |   0.8  |

最后一道题，

所以我需要通过查看索引数组和列 id

来聚合 double 数组的值

所以结果应该是

| id | prop1          | values                   | 
|----|----------------|--------------------------| 
|  1 | [2,5,1,3]      |   [ 0.1, 0.8, 1.1, 0.8 ] | 
|  2 | [2,1,3]        |   [ 0.3, 0.3, 0.0 ]      | 


Below code I am using to extract the values by index and pivot right before merging them to array

//dummy dataframe to get the sequence of 5 but the upper end is dynamic value and that can extend till 300k
var df = (1 to 5).toDF("prop1")

//joining original Df by prop1 column 
var stgDf = originalDf.join(df,originalDf.col("prop1") ===  df.col("prop1"),"inner")

// pivoting the values by index
var pivotDf = stgDf.groupBy("id")
             .pivot("prop1").agg(first("values"))

 // now aggregating the pivoted  values by id
 var expr = pivtoDf.columns.map(sum(_))
 var pivotDf.groupBy("id").agg(expr.head,expr.tail:_*)

 //then grouping back into array by id

我使用展开 prop1 和值的解决方案，它确实适用于几行，但在实际问题中，两列的数组每个都可以超过 500k 个值，但没有。每个 id 的行数可以超过 3000 万

如果有人可以查看并帮助解决这个问题，那就太好了。应用程序是使用 spark 2.4

在 scala 中构建的

提前致谢

Answer 1

对于 v3.x，不是 v2.4。升级太难了。

一些严重的数据争论！

可能有更好的方法，但它是可扩展的。可能需要很多分区。

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Column

val arrayStructureData = Seq(
Row(1,List(2,5,1,3),List(0.1, 0.5, 0.7, 0.8)),
Row(2,List(2,1),List(0.2, 0.3)),
Row(1,List(1,5),List(0.4, 0.3)),
Row(2,List(3,2),List(0.0, 0.1)) 
)
// Just a single StructType for the Row
val arrayStructureSchema = new StructType()
    .add("id",IntegerType)
    .add("prop1", ArrayType(IntegerType))
    .add("values", ArrayType(DoubleType))
val df = spark.createDataFrame(spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)
df.printSchema()
df.show()

val df2 = df.withColumn(
  "jCols",
  zip_with(
    col("prop1"),
    col("values"),
      // Should be a struct really, but...array used. zip_with not available in v2.4!
      (left: Column, right: Column) => array(left, right)
  )
).drop('prop1).drop('values)

df2.show(false)
df2.printSchema()

val df3 = df2.groupBy("id").agg(collect_list("jCols").as("jCols"))
df3.printSchema()
df3.show(false)

val df4 = df3.select($"id",flatten($"jCols").as("jCols"))
df4.show(false)
df4.printSchema()

val df5 = df4.withColumn("ExjCols", explode($"jCols")).drop("jCols")
df5.show(false)
df5.printSchema()

val df6 = df5.select(col("id"),col("ExjCols")(0).as("prop1"),col("ExjCols")(1).as("values"))
df6.show(false)
df6.printSchema()

val df7 = df6.groupBy("id", "prop1").sum("values").toDF("id","prop1","values") 
df7.show(false)
df7.printSchema()

val df8 = df7.withColumn("combined", array($"prop1", $"values"))
df8.show(false)
df8.printSchema()

val df9 = df8.groupBy("id").agg(collect_list("combined").as("propN"))
df9.show(false)
df9.printSchema()

val res = df9.withColumn("prop1",expr("transform(propN, x -> x[0])")).withColumn("values",expr("transform(propN, x -> x[1])")).drop('propN)
res.show(false)

returns:

+---+--------------------+-------------------------------+
|id |prop1               |values                         |
+---+--------------------+-------------------------------+
|1  |[2.0, 5.0, 1.0, 3.0]|[0.1, 0.8, 1.1, 0.8]           |
|2  |[2.0, 1.0, 3.0]     |[0.30000000000000004, 0.3, 0.0]|
+---+--------------------+-------------------------------+

不知道为什么会出现精度 0.3000...，但确实如此。也更正了示例，它有一些错误。

我只能假设 SO 现在不那么流行了，因为它需要一段时间才能得到答案。

参考另一个数组列的 Spark 数据帧聚合

Spark dataframe aggregation with reference to another array column

arrays

scala

bigdata

apache-spark