我如何在spark scala中获得如下输出
How can i get output as below jn spark scala
我有如下数据。
A
B
C
D
1
A
Day
D1
1
A
Tim
1am
1
A
Tim
3am
需要这样创建
A
B
Day
Tim1
Tim2
1
A
D1
1am
3am
你能帮忙看看如何进入spark scala吗
您可以先添加重复项的行号,然后进行数据透视。
import org.apache.spark.sql.expressions.Window
val w1 = Window.partitionBy("A", "B", "C").orderBy("D")
val w2 = Window.partitionBy("A", "B", "C")
val df1 = df0.withColumn("row_num", row_number().over(w1)).withColumn("max_num", max("row_num").over(w2))
df1.show(false)
//+---+---+---+---+-------+-------+
//|A |B |C |D |row_num|max_num|
//+---+---+---+---+-------+-------+
//|1 |A |Tim|1am|1 |2 |
//|1 |A |Tim|3am|2 |2 |
//|1 |A |Day|D1 |1 |1 |
//+---+---+---+---+-------+-------+
val df2 = df1.withColumn("C", expr("if(max_num != 1, concat(C, row_num), C)"))
df2.show(false)
//+---+---+----+---+-------+-------+
//|A |B |C |D |row_num|max_num|
//+---+---+----+---+-------+-------+
//|1 |A |Tim1|1am|1 |2 |
//|1 |A |Tim2|3am|2 |2 |
//|1 |A |Day |D1 |1 |1 |
//+---+---+----+---+-------+-------+
val df3 = df2.groupBy("A", "B").pivot("C").agg(first("D"))
df3.show(false)
//+---+---+---+----+----+
//|A |B |Day|Tim1|Tim2|
//+---+---+---+----+----+
//|1 |A |D1 |1am |3am |
//+---+---+---+----+----+
我有如下数据。
A | B | C | D |
---|---|---|---|
1 | A | Day | D1 |
1 | A | Tim | 1am |
1 | A | Tim | 3am |
需要这样创建
A | B | Day | Tim1 | Tim2 |
---|---|---|---|---|
1 | A | D1 | 1am | 3am |
你能帮忙看看如何进入spark scala吗
您可以先添加重复项的行号,然后进行数据透视。
import org.apache.spark.sql.expressions.Window
val w1 = Window.partitionBy("A", "B", "C").orderBy("D")
val w2 = Window.partitionBy("A", "B", "C")
val df1 = df0.withColumn("row_num", row_number().over(w1)).withColumn("max_num", max("row_num").over(w2))
df1.show(false)
//+---+---+---+---+-------+-------+
//|A |B |C |D |row_num|max_num|
//+---+---+---+---+-------+-------+
//|1 |A |Tim|1am|1 |2 |
//|1 |A |Tim|3am|2 |2 |
//|1 |A |Day|D1 |1 |1 |
//+---+---+---+---+-------+-------+
val df2 = df1.withColumn("C", expr("if(max_num != 1, concat(C, row_num), C)"))
df2.show(false)
//+---+---+----+---+-------+-------+
//|A |B |C |D |row_num|max_num|
//+---+---+----+---+-------+-------+
//|1 |A |Tim1|1am|1 |2 |
//|1 |A |Tim2|3am|2 |2 |
//|1 |A |Day |D1 |1 |1 |
//+---+---+----+---+-------+-------+
val df3 = df2.groupBy("A", "B").pivot("C").agg(first("D"))
df3.show(false)
//+---+---+---+----+----+
//|A |B |Day|Tim1|Tim2|
//+---+---+---+----+----+
//|1 |A |D1 |1am |3am |
//+---+---+---+----+----+