Spark DataFrame - 使用逻辑删除行
Spark DataFrame - Remove Rows with logic
下一个案例我需要帮助:
我有下一个DataFrame
我需要删除 CNTA_TIPODOCUMENTOS
和 CNTA_NRODOCUMENTO 重复的行并按最后 CNTA_FECHA_FORMULARIO
排序,例如 CNTA_NRODOCUMENTO
35468731
。我应该得到这一行。
| 1| 35468731| 2012-08-25 00:00:...| MARIA|
你对此有什么想法吗?
谢谢
一种方法是使用 Window
函数 row_number
将正确分区的日期按降序排列,并只选择每个分区的第一行:
val df = Seq(
(1, 80025709, "2010-07-19 00:00:00", "JUAN"),
(1, 35468731, "2010-07-28 00:00:00", "PEDRO"),
(1, 51714038, "2010-08-02 00:00:00", "ALEX"),
(1, 35468731, "2011-09-28 00:00:00", "KAREN"),
(1, 35468731, "2012-08-25 00:00:00", "MARIA")
).toDF("c1", "c2", "date", "name")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
df.withColumn(
"rownum",
row_number.over(Window.partitionBy($"c1", $"c2").orderBy($"date".desc))
).
select($"c1", $"c2", $"date", $"name").
where($"rownum" === 1).
show
// +---+--------+-------------------+-----+
// | c1| c2| date| name|
// +---+--------+-------------------+-----+
// | 1|51714038|2010-08-02 00:00:00| ALEX|
// | 1|80025709|2010-07-19 00:00:00| JUAN|
// | 1|35468731|2012-08-25 00:00:00|MARIA|
// +---+--------+-------------------+-----+
下一个案例我需要帮助:
我有下一个DataFrame
我需要删除 CNTA_TIPODOCUMENTOS
和 CNTA_NRODOCUMENTO 重复的行并按最后 CNTA_FECHA_FORMULARIO
排序,例如 CNTA_NRODOCUMENTO
35468731
。我应该得到这一行。
| 1| 35468731| 2012-08-25 00:00:...| MARIA|
你对此有什么想法吗? 谢谢
一种方法是使用 Window
函数 row_number
将正确分区的日期按降序排列,并只选择每个分区的第一行:
val df = Seq(
(1, 80025709, "2010-07-19 00:00:00", "JUAN"),
(1, 35468731, "2010-07-28 00:00:00", "PEDRO"),
(1, 51714038, "2010-08-02 00:00:00", "ALEX"),
(1, 35468731, "2011-09-28 00:00:00", "KAREN"),
(1, 35468731, "2012-08-25 00:00:00", "MARIA")
).toDF("c1", "c2", "date", "name")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
df.withColumn(
"rownum",
row_number.over(Window.partitionBy($"c1", $"c2").orderBy($"date".desc))
).
select($"c1", $"c2", $"date", $"name").
where($"rownum" === 1).
show
// +---+--------+-------------------+-----+
// | c1| c2| date| name|
// +---+--------+-------------------+-----+
// | 1|51714038|2010-08-02 00:00:00| ALEX|
// | 1|80025709|2010-07-19 00:00:00| JUAN|
// | 1|35468731|2012-08-25 00:00:00|MARIA|
// +---+--------+-------------------+-----+