Spark 2.2 数据框 [scala]

Spark 2.2 dataframe [scala]

OrderNo    Status1    Status2     Status3
123    Completed      Pending     Pending
456    Rejected   Completed   Completed
789    Pending    In Progress     Completed

上面是 table,它是输入数据集,下面是预期的输出。这里要注意的是,我们应该根据状态出现的顺序来计算,而不是根据状态出现的次数。我们可以在使用 scala 的 spark dataframes 的帮助下做到这一点吗?提前感谢您的帮助。

Pending     2
Rejected    1
Completed   3
In Progress 2

您可以试试下面的代码。它计算所有状态的不同 OrderNo 的数量。希望对你有帮助。

val rawDF = Seq(
  ("123", "Completed", "Pending", "Pending"),
  ("456", "Rejected", "Completed", "Completed"),
  ("789", "Pending", "In Progress", "Completed")
).toDF("OrderNo", "Status1", "Status2", "Status3")

val newDF = rawDF.withColumn("All_Status",  array($"Status1", $"Status2", $"Status3"))
    .withColumn("Status", explode($"All_Status"))
    .groupBy("Status").agg(size(collect_set($"OrderNo")).as("DistOrderCnt"))

这是结果。 (注:In Progress在测试数据中只出现一次。)

+-----------+------------+ | Status|DistOrderCnt| +-----------+------------+ | Completed| 3| |In Progress| 1| | Pending| 2| | Rejected| 1| +-----------+------------+