Spark 2.2 数据框 [scala]
Spark 2.2 dataframe [scala]
OrderNo Status1 Status2 Status3
123 Completed Pending Pending
456 Rejected Completed Completed
789 Pending In Progress Completed
上面是 table,它是输入数据集,下面是预期的输出。这里要注意的是,我们应该根据状态出现的顺序来计算,而不是根据状态出现的次数。我们可以在使用 scala 的 spark dataframes 的帮助下做到这一点吗?提前感谢您的帮助。
Pending 2
Rejected 1
Completed 3
In Progress 2
您可以试试下面的代码。它计算所有状态的不同 OrderNo 的数量。希望对你有帮助。
val rawDF = Seq(
("123", "Completed", "Pending", "Pending"),
("456", "Rejected", "Completed", "Completed"),
("789", "Pending", "In Progress", "Completed")
).toDF("OrderNo", "Status1", "Status2", "Status3")
val newDF = rawDF.withColumn("All_Status", array($"Status1", $"Status2", $"Status3"))
.withColumn("Status", explode($"All_Status"))
.groupBy("Status").agg(size(collect_set($"OrderNo")).as("DistOrderCnt"))
这是结果。 (注:In Progress在测试数据中只出现一次。)
+-----------+------------+
| Status|DistOrderCnt|
+-----------+------------+
| Completed| 3|
|In Progress| 1|
| Pending| 2|
| Rejected| 1|
+-----------+------------+
OrderNo Status1 Status2 Status3
123 Completed Pending Pending
456 Rejected Completed Completed
789 Pending In Progress Completed
上面是 table,它是输入数据集,下面是预期的输出。这里要注意的是,我们应该根据状态出现的顺序来计算,而不是根据状态出现的次数。我们可以在使用 scala 的 spark dataframes 的帮助下做到这一点吗?提前感谢您的帮助。
Pending 2
Rejected 1
Completed 3
In Progress 2
您可以试试下面的代码。它计算所有状态的不同 OrderNo 的数量。希望对你有帮助。
val rawDF = Seq(
("123", "Completed", "Pending", "Pending"),
("456", "Rejected", "Completed", "Completed"),
("789", "Pending", "In Progress", "Completed")
).toDF("OrderNo", "Status1", "Status2", "Status3")
val newDF = rawDF.withColumn("All_Status", array($"Status1", $"Status2", $"Status3"))
.withColumn("Status", explode($"All_Status"))
.groupBy("Status").agg(size(collect_set($"OrderNo")).as("DistOrderCnt"))
这是结果。 (注:In Progress在测试数据中只出现一次。)
+-----------+------------+
| Status|DistOrderCnt|
+-----------+------------+
| Completed| 3|
|In Progress| 1|
| Pending| 2|
| Rejected| 1|
+-----------+------------+