Spark DataFrame - 使用逻辑删除行

Spark DataFrame - Remove Rows with logic

下一个案例我需要帮助:

我有下一个DataFrame

我需要删除 CNTA_TIPODOCUMENTOS 和 CNTA_NRODOCUMENTO 重复的行并按最后 CNTA_FECHA_FORMULARIO 排序,例如 CNTA_NRODOCUMENTO 35468731。我应该得到这一行。

|                  1|         35468731| 2012-08-25 00:00:...|              MARIA| 

你对此有什么想法吗? 谢谢

一种方法是使用 Window 函数 row_number 将正确分区的日期按降序排列,并只选择每个分区的第一行:

val df = Seq(
  (1, 80025709, "2010-07-19 00:00:00", "JUAN"),
  (1, 35468731, "2010-07-28 00:00:00", "PEDRO"),
  (1, 51714038, "2010-08-02 00:00:00", "ALEX"),
  (1, 35468731, "2011-09-28 00:00:00", "KAREN"),
  (1, 35468731, "2012-08-25 00:00:00", "MARIA")
).toDF("c1", "c2", "date", "name")

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

df.withColumn(
    "rownum",
    row_number.over(Window.partitionBy($"c1", $"c2").orderBy($"date".desc))
  ).
  select($"c1", $"c2", $"date", $"name").
  where($"rownum" === 1).
  show

// +---+--------+-------------------+-----+
// | c1|      c2|               date| name|
// +---+--------+-------------------+-----+
// |  1|51714038|2010-08-02 00:00:00| ALEX|
// |  1|80025709|2010-07-19 00:00:00| JUAN|
// |  1|35468731|2012-08-25 00:00:00|MARIA|
// +---+--------+-------------------+-----+