Spark - 如何检查日期是否连续
Spark - how to check if dates are not consecutive
我有一个包含 start_time
(时间戳)、ev_date
(整数)列的数据框。我想弄清楚这两列中的日期是否连续。我只需要比较每一列中的日期是否连续。我正在考虑使用 lag
函数,但不确定如何实现它。我怎样才能找出丢失的日期?非常感谢。
输入:
ev_date | start_time
---------------------
20220301| 2022-02-28 10:09:21.356782917
20220301| 2022-02-28 03:09:21.756982919
20220302| 2022-03-01 03:09:21.756982919
20220303| 2022-03-02 03:09:21.756982919
20220305| 2022-03-02 03:09:21.756982919 --ev_date is not right here as 20020304 is missing
20220306| 2022-03-06 03:09:21.756982919 --start_time is not right as it jumped from 03-02 to 03-06
您可以添加一个包含差异的新列,然后过滤差异大于 1 天的地方。类似于 (python 但与 scala 类似 - 不确定你需要标签中的哪种语言)
from pyspark.sql.functions import *
from pyspark.sql.window import Window
df1= df.withColumn("diff", datediff(df.ev_date, lag(df.ev_date, 1)
.over(Window.partitionBy("some other column")
.orderBy("ev_date"))))
df1.filter(df1.diff > 1)
我有一个包含 start_time
(时间戳)、ev_date
(整数)列的数据框。我想弄清楚这两列中的日期是否连续。我只需要比较每一列中的日期是否连续。我正在考虑使用 lag
函数,但不确定如何实现它。我怎样才能找出丢失的日期?非常感谢。
输入:
ev_date | start_time
---------------------
20220301| 2022-02-28 10:09:21.356782917
20220301| 2022-02-28 03:09:21.756982919
20220302| 2022-03-01 03:09:21.756982919
20220303| 2022-03-02 03:09:21.756982919
20220305| 2022-03-02 03:09:21.756982919 --ev_date is not right here as 20020304 is missing
20220306| 2022-03-06 03:09:21.756982919 --start_time is not right as it jumped from 03-02 to 03-06
您可以添加一个包含差异的新列,然后过滤差异大于 1 天的地方。类似于 (python 但与 scala 类似 - 不确定你需要标签中的哪种语言)
from pyspark.sql.functions import *
from pyspark.sql.window import Window
df1= df.withColumn("diff", datediff(df.ev_date, lag(df.ev_date, 1)
.over(Window.partitionBy("some other column")
.orderBy("ev_date"))))
df1.filter(df1.diff > 1)