根据与日期的距离在 Snowflake 中分区和选择值
Partitioning and selecting a value in Snowflake based on distance from date
我有 10 亿行数据集,该数据集不断构建,包含更多关于客户的重复数据。
ID creation_date report_date status
001 2021-01-20T00:22:06Z 2021-02-02T00:22:06Z ACTIVE
002 2021-01-30T00:22:06Z 2021-02-02T00:22:06Z ACTIVE
003 2021-02-01T00:22:06Z 2021-02-02T00:22:06Z ACTIVE
001 2021-01-20T00:22:06Z 2021-02-02T00:23:06Z ACTIVE
002 2021-01-30T00:22:06Z 2021-02-02T00:23:06Z ACTIVE
003 2021-02-01T00:22:06Z 2021-02-02T00:23:06Z ACTIVE
001 2021-01-20T00:22:06Z 2021-02-19T00:22:06Z ACTIVE
002 2021-01-30T00:22:06Z 2021-02-19T00:22:06Z ACTIVE
003 2021-02-01T00:22:06Z 2021-02-19T00:22:06Z ACTIVE
001 2021-01-20T00:22:06Z 2021-02-20T00:22:06Z ACTIVE
002 2021-01-30T00:22:06Z 2021-02-20T00:22:06Z EXPIRED
003 2021-02-01T00:22:06Z 2021-02-20T00:22:06Z EXPIRED
001 2021-01-20T00:22:06Z 2021-02-21T00:22:06Z ACTIVE
002 2021-01-30T00:22:06Z 2021-02-21T00:22:06Z EXPIRED
003 2021-02-01T00:22:06Z 2021-02-21T00:22:06Z EXPIRED
001 2021-01-20T00:22:06Z 2021-02-30T00:22:06Z ACTIVE
002 2021-01-30T00:22:06Z 2021-02-30T00:22:06Z EXPIRED
003 2021-02-01T00:22:06Z 2021-02-30T00:22:06Z EXPIRED
001 2021-01-20T00:22:06Z 2021-03-01T00:22:06Z ACTIVE
002 2021-01-30T00:22:06Z 2021-03-01T00:22:06Z EXPIRED
003 2021-02-01T00:22:06Z 2021-03-01T00:22:06Z ACTIVE
001 2021-01-20T00:22:06Z 2021-03-22T00:22:06Z EXPIRED
002 2021-01-30T00:22:06Z 2021-03-22T00:22:06Z EXPIRED
003 2021-02-01T00:22:06Z 2021-03-22T00:22:06Z EXPIRED
每个 report_date 表示所有记录更新到当前状态的日期。就像检查脉搏一样。
我只想要用户在创建日期(第 5 周)一个月后的一周内的最后状态。
例如:ID = 001。
在这里我们看到他们的创建日期是 2021-01-20
,这意味着从这个日期开始的一个月是 2021-02-20
。我想知道:
- 在
2021-02-20
和 2021-02-27
之间的报告日期期间,此用户的最终状态是什么?
您可以在上面的数据中看到,Active 在 2021-02-20
和 2021-02-27
之间的所有报告中都保持活跃(已列出。)
为简单起见,我们只想知道此时间范围内的最后一次状态更改。请注意,在 ID=003 中,它们在前一天 2021-02-22', so though they were
EXPIRED` 切换到 ACTIVE
,它们在边界内切换回活动状态。
一个月后一周后的任何内容(5 周后的任何内容)都无关紧要。
您可能还会注意到,从 2021-01-30
开始的 1 个月是 2021-02-30
,这没有意义。在这些情况下,使用月份的最后一天,或 2021-02-28
.
最终输出:
ID week_5_status
001 ACTIVE
002 EXPIRED
003 ACTIVE
首先,将文本值(大概)转换为有效的日期时间值。然后,过滤行,使 report_datetime 在 creation_datetime 之后少于 6 周。取该过滤列表的最大值,然后连接回原始数据以获取具有最大值的行的状态。
CREATE TABLE t (id int, creation_date VARCHAR(19), report_date VARCHAR(19), status text);
INSERT INTO t (id,creation_date,report_date,status) VALUES
(1,'2021-01-20T00:22:06','2021-02-02T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-02T00:22:06','ACTIVE'),
(3,'2021-02-01T00:22:06','2021-02-02T00:22:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-02-02T00:23:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-02T00:23:06','ACTIVE'),
(3,'2021-02-01T00:22:06','2021-02-02T00:23:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-02-19T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-19T00:22:06','ACTIVE'),
(3,'2021-02-01T00:22:06','2021-02-19T00:22:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-02-20T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-20T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-02-20T00:22:06','EXPIRED'),
(1,'2021-01-20T00:22:06','2021-02-21T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-21T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-02-21T00:22:06','EXPIRED'),
(1,'2021-01-20T00:22:06','2021-02-30T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-30T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-02-30T00:22:06','EXPIRED'),
(1,'2021-01-20T00:22:06','2021-03-01T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-03-01T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-03-01T00:22:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-03-22T00:22:06','EXPIRED'),
(2,'2021-01-30T00:22:06','2021-03-22T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-03-22T00:22:06','EXPIRED');
WITH dat
AS
(
SELECT id
, CAST(creation_date AS datetime) AS creation_datetime
, CAST(REPLACE(report_date,'02-30','02-28') AS datetime) AS report_datetime
, status
FROM t
),
dat2
AS
(
SELECT id
,MAX(report_datetime) AS max_report_datetime
FROM dat
WHERE DATEDIFF(week,creation_datetime,report_datetime) < 6
GROUP BY id
)
SELECT dat.*
FROM dat
INNER JOIN dat2
ON dat.id = dat2.id
AND dat.report_datetime = dat2.max_report_datetime;
QUALIFY and ROW_NUMBER好像是你想要的select里面的“最后状态window”
所以对于 data
的 CTE,我将几个无效的 report_date
更改为真实日期。可能方向不对,但不影响 SQL.
WITH data(id,creation_date,report_date,status)AS (
SELECT column1
,to_date(column2, 'YYYY-MM-DDThh:mi:ss')
,to_date(column3, 'YYYY-MM-DDThh:mi:ss')
,column4
FROM VALUES
(1,'2021-01-20T00:22:06','2021-02-02T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-02T00:22:06','ACTIVE'),
(3,'2021-02-01T00:22:06','2021-02-02T00:22:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-02-02T00:23:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-02T00:23:06','ACTIVE'),
(3,'2021-02-01T00:22:06','2021-02-02T00:23:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-02-19T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-19T00:22:06','ACTIVE'),
(3,'2021-02-01T00:22:06','2021-02-19T00:22:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-02-20T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-20T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-02-20T00:22:06','EXPIRED'),
(1,'2021-01-20T00:22:06','2021-02-21T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-21T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-02-21T00:22:06','EXPIRED'),
(1,'2021-01-20T00:22:06','2021-02-28T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-28T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-02-28T00:22:06','EXPIRED'),
(1,'2021-01-20T00:22:06','2021-03-01T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-03-01T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-03-01T00:22:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-03-22T00:22:06','EXPIRED'),
(2,'2021-01-30T00:22:06','2021-03-22T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-03-22T00:22:06','EXPIRED')
)
主要SQL变为:
SELECT d.id
,d.creation_date
,d.report_date
,d.status
FROM data AS d
WHERE dateadd(week,5,d.creation_date) >= d.report_date
QUALIFY row_number() over (partition by id order by report_date desc) = 1 ;
给出:
ID
CREATION_DATE
REPORT_DATE
STATUS
1
2021-01-20
2021-02-21
ACTIVE
2
2021-01-30
2021-03-01
EXPIRED
3
2021-02-01
2021-03-01
ACTIVE
或者您实际上只需要两列:
SELECT d.id
,d.status
FROM data AS d
WHERE dateadd(week,5,d.creation_date) >= d.report_date
QUALIFY row_number() over (partition by id order by report_date desc) = 1 ;
给出:
ID
STATUS
1
ACTIVE
2
EXPIRED
3
ACTIVE
我有 10 亿行数据集,该数据集不断构建,包含更多关于客户的重复数据。
ID creation_date report_date status
001 2021-01-20T00:22:06Z 2021-02-02T00:22:06Z ACTIVE
002 2021-01-30T00:22:06Z 2021-02-02T00:22:06Z ACTIVE
003 2021-02-01T00:22:06Z 2021-02-02T00:22:06Z ACTIVE
001 2021-01-20T00:22:06Z 2021-02-02T00:23:06Z ACTIVE
002 2021-01-30T00:22:06Z 2021-02-02T00:23:06Z ACTIVE
003 2021-02-01T00:22:06Z 2021-02-02T00:23:06Z ACTIVE
001 2021-01-20T00:22:06Z 2021-02-19T00:22:06Z ACTIVE
002 2021-01-30T00:22:06Z 2021-02-19T00:22:06Z ACTIVE
003 2021-02-01T00:22:06Z 2021-02-19T00:22:06Z ACTIVE
001 2021-01-20T00:22:06Z 2021-02-20T00:22:06Z ACTIVE
002 2021-01-30T00:22:06Z 2021-02-20T00:22:06Z EXPIRED
003 2021-02-01T00:22:06Z 2021-02-20T00:22:06Z EXPIRED
001 2021-01-20T00:22:06Z 2021-02-21T00:22:06Z ACTIVE
002 2021-01-30T00:22:06Z 2021-02-21T00:22:06Z EXPIRED
003 2021-02-01T00:22:06Z 2021-02-21T00:22:06Z EXPIRED
001 2021-01-20T00:22:06Z 2021-02-30T00:22:06Z ACTIVE
002 2021-01-30T00:22:06Z 2021-02-30T00:22:06Z EXPIRED
003 2021-02-01T00:22:06Z 2021-02-30T00:22:06Z EXPIRED
001 2021-01-20T00:22:06Z 2021-03-01T00:22:06Z ACTIVE
002 2021-01-30T00:22:06Z 2021-03-01T00:22:06Z EXPIRED
003 2021-02-01T00:22:06Z 2021-03-01T00:22:06Z ACTIVE
001 2021-01-20T00:22:06Z 2021-03-22T00:22:06Z EXPIRED
002 2021-01-30T00:22:06Z 2021-03-22T00:22:06Z EXPIRED
003 2021-02-01T00:22:06Z 2021-03-22T00:22:06Z EXPIRED
每个 report_date 表示所有记录更新到当前状态的日期。就像检查脉搏一样。
我只想要用户在创建日期(第 5 周)一个月后的一周内的最后状态。
例如:ID = 001。
在这里我们看到他们的创建日期是 2021-01-20
,这意味着从这个日期开始的一个月是 2021-02-20
。我想知道:
- 在
2021-02-20
和2021-02-27
之间的报告日期期间,此用户的最终状态是什么?
您可以在上面的数据中看到,Active 在 2021-02-20
和 2021-02-27
之间的所有报告中都保持活跃(已列出。)
为简单起见,我们只想知道此时间范围内的最后一次状态更改。请注意,在 ID=003 中,它们在前一天 2021-02-22', so though they were
EXPIRED` 切换到 ACTIVE
,它们在边界内切换回活动状态。
一个月后一周后的任何内容(5 周后的任何内容)都无关紧要。
您可能还会注意到,从 2021-01-30
开始的 1 个月是 2021-02-30
,这没有意义。在这些情况下,使用月份的最后一天,或 2021-02-28
.
最终输出:
ID week_5_status
001 ACTIVE
002 EXPIRED
003 ACTIVE
首先,将文本值(大概)转换为有效的日期时间值。然后,过滤行,使 report_datetime 在 creation_datetime 之后少于 6 周。取该过滤列表的最大值,然后连接回原始数据以获取具有最大值的行的状态。
CREATE TABLE t (id int, creation_date VARCHAR(19), report_date VARCHAR(19), status text);
INSERT INTO t (id,creation_date,report_date,status) VALUES
(1,'2021-01-20T00:22:06','2021-02-02T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-02T00:22:06','ACTIVE'),
(3,'2021-02-01T00:22:06','2021-02-02T00:22:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-02-02T00:23:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-02T00:23:06','ACTIVE'),
(3,'2021-02-01T00:22:06','2021-02-02T00:23:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-02-19T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-19T00:22:06','ACTIVE'),
(3,'2021-02-01T00:22:06','2021-02-19T00:22:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-02-20T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-20T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-02-20T00:22:06','EXPIRED'),
(1,'2021-01-20T00:22:06','2021-02-21T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-21T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-02-21T00:22:06','EXPIRED'),
(1,'2021-01-20T00:22:06','2021-02-30T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-30T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-02-30T00:22:06','EXPIRED'),
(1,'2021-01-20T00:22:06','2021-03-01T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-03-01T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-03-01T00:22:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-03-22T00:22:06','EXPIRED'),
(2,'2021-01-30T00:22:06','2021-03-22T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-03-22T00:22:06','EXPIRED');
WITH dat
AS
(
SELECT id
, CAST(creation_date AS datetime) AS creation_datetime
, CAST(REPLACE(report_date,'02-30','02-28') AS datetime) AS report_datetime
, status
FROM t
),
dat2
AS
(
SELECT id
,MAX(report_datetime) AS max_report_datetime
FROM dat
WHERE DATEDIFF(week,creation_datetime,report_datetime) < 6
GROUP BY id
)
SELECT dat.*
FROM dat
INNER JOIN dat2
ON dat.id = dat2.id
AND dat.report_datetime = dat2.max_report_datetime;
QUALIFY and ROW_NUMBER好像是你想要的select里面的“最后状态window”
所以对于 data
的 CTE,我将几个无效的 report_date
更改为真实日期。可能方向不对,但不影响 SQL.
WITH data(id,creation_date,report_date,status)AS (
SELECT column1
,to_date(column2, 'YYYY-MM-DDThh:mi:ss')
,to_date(column3, 'YYYY-MM-DDThh:mi:ss')
,column4
FROM VALUES
(1,'2021-01-20T00:22:06','2021-02-02T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-02T00:22:06','ACTIVE'),
(3,'2021-02-01T00:22:06','2021-02-02T00:22:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-02-02T00:23:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-02T00:23:06','ACTIVE'),
(3,'2021-02-01T00:22:06','2021-02-02T00:23:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-02-19T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-19T00:22:06','ACTIVE'),
(3,'2021-02-01T00:22:06','2021-02-19T00:22:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-02-20T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-20T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-02-20T00:22:06','EXPIRED'),
(1,'2021-01-20T00:22:06','2021-02-21T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-21T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-02-21T00:22:06','EXPIRED'),
(1,'2021-01-20T00:22:06','2021-02-28T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-28T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-02-28T00:22:06','EXPIRED'),
(1,'2021-01-20T00:22:06','2021-03-01T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-03-01T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-03-01T00:22:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-03-22T00:22:06','EXPIRED'),
(2,'2021-01-30T00:22:06','2021-03-22T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-03-22T00:22:06','EXPIRED')
)
主要SQL变为:
SELECT d.id
,d.creation_date
,d.report_date
,d.status
FROM data AS d
WHERE dateadd(week,5,d.creation_date) >= d.report_date
QUALIFY row_number() over (partition by id order by report_date desc) = 1 ;
给出:
ID | CREATION_DATE | REPORT_DATE | STATUS |
---|---|---|---|
1 | 2021-01-20 | 2021-02-21 | ACTIVE |
2 | 2021-01-30 | 2021-03-01 | EXPIRED |
3 | 2021-02-01 | 2021-03-01 | ACTIVE |
或者您实际上只需要两列:
SELECT d.id
,d.status
FROM data AS d
WHERE dateadd(week,5,d.creation_date) >= d.report_date
QUALIFY row_number() over (partition by id order by report_date desc) = 1 ;
给出:
ID | STATUS |
---|---|
1 | ACTIVE |
2 | EXPIRED |
3 | ACTIVE |