tidyr::spread() 与普通 Scala 和 Spark（行到列）

Question

我的原型（用 R 编写，使用 Scala dplyr and tidyr) is hitting a wall in terms of computational complexity - even on my powerfull working station. Therefore, I want to port the code to Spark 包。

我查找了所有 transformations, actions, functions (SparkSQL) and column operations（还有 SparkSQL）并找到了所有等价函数，除了 tidyr::spread() 函数的函数，在 R 中可用。

df %>% tidyr::spread(key = COL_KEY , value = COL_VAL) 基本上将键值对分布在多个列中。例如。 table

COL_KEY | COL_VAL
-----------------
A       | 1
B       | 1
A       | 2

将由

转换为

A       | B
------------
1       | 0
0       | 1
2       | 1

如果没有 "out-of-the-box"- 可用的解决方案：你能指出正确的方向吗？也许是用户定义的函数？

我可以自由选择哪个 Spark（和 Scala）版本（因此我会选择最新的，2.0.0）。

谢谢！

Answer 1

开箱即用但需要随机播放：

df
  // A dummy unique key to perform grouping
  .withColumn("_id", monotonically_increasing_id)
  .groupBy("_id")
  .pivot("COL_KEY")
  .agg(first("COL_VAL"))
  .drop("_id")

// +----+----+
// |   A|   B|
// +----+----+
// |   1|null|
// |null|   1|
// |   2|null|
// +----+----+

您可以选择在其后添加 .na.fill(0)。

手动不随机播放：

//  Find distinct keys
val keys = df.select($"COL_KEY").as[String].distinct.collect.sorted

// Create column expressions for each key
val exprs =  keys.map(key => 
  when($"COL_KEY" === key, $"COL_VAL").otherwise(lit(0)).alias(key)
)

df.select(exprs: _*)

// +---+---+
// |  A|  B|
// +---+---+
// |  1|  0|
// |  0|  1|
// |  2|  0|
// +---+---+

tidyr::spread() 与普通 Scala 和 Spark（行到列）

tidyr::spread() with plain Scala and Spark (rows to columns)

scala

transformation

apache-spark