Spark - 如何检查日期是否连续

Spark - how to check if dates are not consecutive

我有一个包含 start_time(时间戳)、ev_date(整数)列的数据框。我想弄清楚这两列中的日期是否连续。我只需要比较每一列中的日期是否连续。我正在考虑使用 lag 函数,但不确定如何实现它。我怎样才能找出丢失的日期?非常感谢。

输入:

ev_date |  start_time
---------------------
20220301| 2022-02-28 10:09:21.356782917
20220301| 2022-02-28 03:09:21.756982919
20220302| 2022-03-01 03:09:21.756982919
20220303| 2022-03-02 03:09:21.756982919
20220305| 2022-03-02 03:09:21.756982919 --ev_date is not right here as 20020304 is missing 
20220306| 2022-03-06 03:09:21.756982919 --start_time is not right as it jumped from 03-02 to 03-06

 

您可以添加一个包含差异的新列,然后过滤差异大于 1 天的地方。类似于 (python 但与 scala 类似 - 不确定你需要标签中的哪种语言)

from pyspark.sql.functions import *
from pyspark.sql.window import Window

df1= df.withColumn("diff", datediff(df.ev_date, lag(df.ev_date, 1)
    .over(Window.partitionBy("some other column")
    .orderBy("ev_date")‌​)))
df1.filter(df1.diff > 1)