根据与日期的距离在 Snowflake 中分区和选择值

Partitioning and selecting a value in Snowflake based on distance from date

我有 10 亿行数据集,该数据集不断构建,包含更多关于客户的重复数据。

ID   creation_date          report_date             status
001  2021-01-20T00:22:06Z   2021-02-02T00:22:06Z    ACTIVE
002  2021-01-30T00:22:06Z   2021-02-02T00:22:06Z    ACTIVE
003  2021-02-01T00:22:06Z   2021-02-02T00:22:06Z    ACTIVE
001  2021-01-20T00:22:06Z   2021-02-02T00:23:06Z    ACTIVE
002  2021-01-30T00:22:06Z   2021-02-02T00:23:06Z    ACTIVE
003  2021-02-01T00:22:06Z   2021-02-02T00:23:06Z    ACTIVE
001  2021-01-20T00:22:06Z   2021-02-19T00:22:06Z    ACTIVE
002  2021-01-30T00:22:06Z   2021-02-19T00:22:06Z    ACTIVE
003  2021-02-01T00:22:06Z   2021-02-19T00:22:06Z    ACTIVE
001  2021-01-20T00:22:06Z   2021-02-20T00:22:06Z    ACTIVE
002  2021-01-30T00:22:06Z   2021-02-20T00:22:06Z    EXPIRED
003  2021-02-01T00:22:06Z   2021-02-20T00:22:06Z    EXPIRED
001  2021-01-20T00:22:06Z   2021-02-21T00:22:06Z    ACTIVE
002  2021-01-30T00:22:06Z   2021-02-21T00:22:06Z    EXPIRED
003  2021-02-01T00:22:06Z   2021-02-21T00:22:06Z    EXPIRED
001  2021-01-20T00:22:06Z   2021-02-30T00:22:06Z    ACTIVE
002  2021-01-30T00:22:06Z   2021-02-30T00:22:06Z    EXPIRED
003  2021-02-01T00:22:06Z   2021-02-30T00:22:06Z    EXPIRED
001  2021-01-20T00:22:06Z   2021-03-01T00:22:06Z    ACTIVE
002  2021-01-30T00:22:06Z   2021-03-01T00:22:06Z    EXPIRED
003  2021-02-01T00:22:06Z   2021-03-01T00:22:06Z    ACTIVE
001  2021-01-20T00:22:06Z   2021-03-22T00:22:06Z    EXPIRED
002  2021-01-30T00:22:06Z   2021-03-22T00:22:06Z    EXPIRED
003  2021-02-01T00:22:06Z   2021-03-22T00:22:06Z    EXPIRED

每个 report_date 表示所有记录更新到当前状态的日期。就像检查脉搏一样。

我只想要用户在创建日期(第 5 周)一个月后的一周内的最后状态。

例如:ID = 001。

在这里我们看到他们的创建日期是 2021-01-20,这意味着从这个日期开始的一个月是 2021-02-20。我想知道:

您可以在上面的数据中看到,Active 在 2021-02-202021-02-27 之间的所有报告中都保持活跃(已列出。)

为简单起见,我们只想知道此时间范围内的最后一次状态更改。请注意,在 ID=003 中,它们在前一天 2021-02-22', so though they were EXPIRED` 切换到 ACTIVE,它们在边界内切换回活动状态。

一个月后一周后的任何内容(5 周后的任何内容)都无关紧要。

您可能还会注意到,从 2021-01-30 开始的 1 个月是 2021-02-30,这没有意义。在这些情况下,使用月份的最后一天,或 2021-02-28.

最终输出:

ID    week_5_status
001          ACTIVE   
002         EXPIRED
003          ACTIVE

首先,将文本值(大概)转换为有效的日期时间值。然后,过滤行,使 report_datetime 在 creation_datetime 之后少于 6 周。取该过滤列表的最大值,然后连接回原始数据以获取具有最大值的行的状态。

CREATE TABLE t (id int, creation_date VARCHAR(19), report_date VARCHAR(19), status text);
INSERT INTO t (id,creation_date,report_date,status) VALUES 
(1,'2021-01-20T00:22:06','2021-02-02T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-02T00:22:06','ACTIVE'),
(3,'2021-02-01T00:22:06','2021-02-02T00:22:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-02-02T00:23:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-02T00:23:06','ACTIVE'),
(3,'2021-02-01T00:22:06','2021-02-02T00:23:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-02-19T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-19T00:22:06','ACTIVE'),
(3,'2021-02-01T00:22:06','2021-02-19T00:22:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-02-20T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-20T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-02-20T00:22:06','EXPIRED'),
(1,'2021-01-20T00:22:06','2021-02-21T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-21T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-02-21T00:22:06','EXPIRED'),
(1,'2021-01-20T00:22:06','2021-02-30T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-30T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-02-30T00:22:06','EXPIRED'),
(1,'2021-01-20T00:22:06','2021-03-01T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-03-01T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-03-01T00:22:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-03-22T00:22:06','EXPIRED'),
(2,'2021-01-30T00:22:06','2021-03-22T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-03-22T00:22:06','EXPIRED');


WITH dat
AS
(
SELECT id
, CAST(creation_date AS datetime) AS creation_datetime
, CAST(REPLACE(report_date,'02-30','02-28') AS datetime) AS report_datetime
, status
FROM t
),
dat2
AS
(
SELECT id
,MAX(report_datetime) AS max_report_datetime
FROM dat
WHERE DATEDIFF(week,creation_datetime,report_datetime) < 6
GROUP BY id
)
SELECT dat.*
FROM dat
     INNER JOIN dat2 
             ON dat.id = dat2.id 
            AND dat.report_datetime = dat2.max_report_datetime;

dbfiddle.uk

QUALIFY and ROW_NUMBER好像是你想要的select里面的“最后状态window”

所以对于 data 的 CTE,我将几个无效的 report_date 更改为真实日期。可能方向不对,但不影响 SQL.

 WITH data(id,creation_date,report_date,status)AS (
     SELECT column1
        ,to_date(column2, 'YYYY-MM-DDThh:mi:ss')
        ,to_date(column3, 'YYYY-MM-DDThh:mi:ss')
        ,column4 
     FROM VALUES 
    (1,'2021-01-20T00:22:06','2021-02-02T00:22:06','ACTIVE'),
    (2,'2021-01-30T00:22:06','2021-02-02T00:22:06','ACTIVE'),
    (3,'2021-02-01T00:22:06','2021-02-02T00:22:06','ACTIVE'),
    (1,'2021-01-20T00:22:06','2021-02-02T00:23:06','ACTIVE'),
    (2,'2021-01-30T00:22:06','2021-02-02T00:23:06','ACTIVE'),
    (3,'2021-02-01T00:22:06','2021-02-02T00:23:06','ACTIVE'),
    (1,'2021-01-20T00:22:06','2021-02-19T00:22:06','ACTIVE'),
    (2,'2021-01-30T00:22:06','2021-02-19T00:22:06','ACTIVE'),
    (3,'2021-02-01T00:22:06','2021-02-19T00:22:06','ACTIVE'),
    (1,'2021-01-20T00:22:06','2021-02-20T00:22:06','ACTIVE'),
    (2,'2021-01-30T00:22:06','2021-02-20T00:22:06','EXPIRED'),
    (3,'2021-02-01T00:22:06','2021-02-20T00:22:06','EXPIRED'),
    (1,'2021-01-20T00:22:06','2021-02-21T00:22:06','ACTIVE'),
    (2,'2021-01-30T00:22:06','2021-02-21T00:22:06','EXPIRED'),
    (3,'2021-02-01T00:22:06','2021-02-21T00:22:06','EXPIRED'),
    (1,'2021-01-20T00:22:06','2021-02-28T00:22:06','ACTIVE'),
    (2,'2021-01-30T00:22:06','2021-02-28T00:22:06','EXPIRED'),
    (3,'2021-02-01T00:22:06','2021-02-28T00:22:06','EXPIRED'),
    (1,'2021-01-20T00:22:06','2021-03-01T00:22:06','ACTIVE'),
    (2,'2021-01-30T00:22:06','2021-03-01T00:22:06','EXPIRED'),
    (3,'2021-02-01T00:22:06','2021-03-01T00:22:06','ACTIVE'),
    (1,'2021-01-20T00:22:06','2021-03-22T00:22:06','EXPIRED'),
    (2,'2021-01-30T00:22:06','2021-03-22T00:22:06','EXPIRED'),
    (3,'2021-02-01T00:22:06','2021-03-22T00:22:06','EXPIRED')
)

主要SQL变为:

SELECT d.id
    ,d.creation_date
    ,d.report_date
    ,d.status
FROM data AS d
WHERE dateadd(week,5,d.creation_date) >= d.report_date
QUALIFY row_number() over (partition by id order by report_date desc) = 1 ;

给出:

ID CREATION_DATE REPORT_DATE STATUS
1 2021-01-20 2021-02-21 ACTIVE
2 2021-01-30 2021-03-01 EXPIRED
3 2021-02-01 2021-03-01 ACTIVE

或者您实际上只需要两列:

SELECT d.id
    ,d.status
FROM data AS d
WHERE dateadd(week,5,d.creation_date) >= d.report_date
QUALIFY row_number() over (partition by id order by report_date desc) = 1 ;

给出:

ID STATUS
1 ACTIVE
2 EXPIRED
3 ACTIVE