如何形成允许给定最大间隔的连续日期组?
How to form groups of consecutive dates allowing for a given maximum gap?
给出一个 table 赞:
person_id
contact_day
days_last_contact
dash_group
1
2015-02-09
1
1
2015-05-01
81
2
1
2015-05-02
1
2
1
2015-05-03
1
2
1
2015-06-01
29
3
1
2015-08-01
61
4
1
2015-08-04
3
4
1
2015-09-01
28
5
2
2015-05-01
1
2
2015-06-01
31
2
2
2015-07-01
30
3
3
2015-05-01
1
3
2015-05-02
1
1
3
2015-05-04
2
1
3
2015-06-01
28
2
3
2015-06-02
1
2
3
2015-06-06
4
3
如何识别连续的天数 - 连续但允许最大间隔?
数据中的原始列是 person_id
和 contact_day
。我想按 person_id
和“连续”(附近的一组天数)进行划分。到目前为止,我的方法是首先计算自上次联系以来的天数 (days_last_contact
),然后尝试使用它来计算列 dash_group
,它在最大阈值内标记行 - 3 天例如。
如何计算 dash_group
?我通过减去 contact_day
来计算 days_last_contact
,它是 1 滞后的,按 person_id 分区并按日期排序)。
SELECT
contact_day - lag(contact_day, 1, NULL)
OVER (PARTITION BY person_id ORDER BY contact_day ASC)
AS days_last_contact
FROM mydata
;
但是我如何使用它来将 days_last_contact
低于某个阈值的行组合在一起? (本例中为 3 天)。因此,在此示例中,dash_group
2 for person_id
1 确定 5 月 1 日、2 日和 3 日临近,但该人的下一个日期是 6 月 1 日,这太远了 (29自上次联系以来的天数,大于阈值 3),因此它得到一个新的 dash_group
。同样,dash_group
4,将8月1日和8月4日归为一组,因为相差3,但是6月2日和6月6日(人3)相差4,然后分到不同的组.
环顾四周后,我发现了例如 this SO question where they point to the 'trick' #4 here,这很老套,但只适用于连续日期/无间隙系列,我需要允许任意间隙。
定义你的partition/pattern,根据你提供的日期数据,模式是相同的personid,按月划分排名。
所以 window 子句应该是 order by person_id, yearandmonth 那么分区应该是 person_id。
关键点是 get/compute pattern.So 这里我 deduced/guess 模式是 year/month 模式。
alter table mydata add column dateym text;
update mydata set dateym = to_char(contact_day,'YYYY-MM');
SELECT
person_id,
contact_day,
dateym,
rank() OVER (PARTITION BY (person_id, dateym) ORDER BY person_id,
contact_day),
dense_rank() OVER (PARTITION BY person_id ORDER BY person_id, dateym)
FROM
mydata;
间隙可以是任意的,但要用条件表达式来表示。
使用递归查询:
WITH RECURSIVE zzz AS (
SELECT person_id
, contact_day
, md.days_last_contact
, row_number() OVER(PARTITION BY person_id ORDER BY contact_day)
AS dash_group
FROM mydata md
WHERE NOT EXISTS ( -- only the group *leaders*
SELECT * FROM mydata nx
WHERE nx.person_id = md.person_id
AND nx.contact_day < md.contact_day
AND nx.contact_day >= md.contact_day -3
)
UNION ALL
SELECT md.person_id
, md.contact_day
, md.days_last_contact
, zzz.dash_group
FROM zzz
JOIN mydata md ON md.person_id = zzz.person_id
AND md.contact_day > zzz.contact_day
AND md.contact_day <= zzz.contact_day +3
AND NOT EXISTS ( SELECT * -- eliminate the middle men ...
FROM mydata nx
WHERE nx.person_id = md.person_id
AND nx.contact_day > zzz.contact_day
AND nx.contact_day < md.contact_day
)
)
SELECT * FROM zzz
ORDER BY person_id,contact_day
;
使用 window 函数可能会有更短的解决方案。
结果:
DROP SCHEMA
CREATE SCHEMA
SET
CREATE TABLE
INSERT 0 14
person_id | contact_day | days_last_contact | dash_group
-----------+-------------+-------------------+------------
1 | 2015-02-09 | | 1
1 | 2015-05-01 | 81 | 2
1 | 2015-05-02 | 1 | 2
1 | 2015-05-03 | 1 | 2
1 | 2015-06-01 | 29 | 3
1 | 2015-08-01 | 61 | 4
1 | 2015-08-04 | 3 | 4
1 | 2015-09-01 | 28 | 5
2 | 2015-05-01 | | 1
2 | 2015-06-01 | 31 | 2
2 | 2015-07-01 | 30 | 3
3 | 2015-05-01 | | 1
3 | 2015-05-02 | 1 | 1
3 | 2015-05-04 | 2 | 1
(14 rows)
如果我没理解错的话,可以试试在SUM
window函数中使用condition
如果我们在 mydata
table 中创建一个 suitable 索引(person_id
和 contact_day
列),我们可能会为此查询获得更好的性能。
查询#1
所以查询可能如下
SELECT
person_id,
contact_day,
days_last_contact,
SUM(CASE WHEN days_last_contact <= 3 THEN 0 ELSE 1 END) OVER(PARTITION BY person_id ORDER BY contact_day)
FROM mydata
ORDER BY person_id, contact_day
;
如果days_last_contact
需要计算,我们可以尝试用子查询来计算。
SELECT
person_id,
contact_day,
days_last_contact,
SUM(CASE WHEN days_last_contact <= 3 THEN 0 ELSE 1 END) OVER(PARTITION BY person_id ORDER BY contact_day)
FROM (
SELECT person_id,
contact_day,
contact_day - lag(contact_day)
OVER (PARTITION BY person_id ORDER BY contact_day ASC)
AS days_last_contact
FROM mydata
) t1
ORDER BY person_id, contact_day
;
person_id
contact_day
days_last_contact
sum
1
2015-02-09T00:00:00.000Z
1
1
2015-05-01T00:00:00.000Z
81
2
1
2015-05-02T00:00:00.000Z
1
2
1
2015-05-03T00:00:00.000Z
1
2
1
2015-06-01T00:00:00.000Z
29
3
1
2015-08-01T00:00:00.000Z
61
4
1
2015-08-04T00:00:00.000Z
3
4
1
2015-09-01T00:00:00.000Z
28
5
2
2015-05-01T00:00:00.000Z
1
2
2015-06-01T00:00:00.000Z
31
2
2
2015-07-01T00:00:00.000Z
30
3
3
2015-05-01T00:00:00.000Z
1
3
2015-05-02T00:00:00.000Z
1
1
3
2015-05-04T00:00:00.000Z
2
1
3
2015-06-01T00:00:00.000Z
28
2
3
2015-06-02T00:00:00.000Z
1
2
3
2015-06-06T00:00:00.000Z
4
3
在第二个 window 函数中计算间隙(大于给定的公差)形成您之后的组号:
SELECT person_id, contact_day
, count(*) FILTER (WHERE gap > 3) OVER (PARTITION BY person_id ORDER BY contact_day) AS dash_group
FROM (
SELECT person_id, contact_day
, contact_day - lag(contact_day) OVER (PARTITION BY person_id ORDER BY contact_day) AS gap
FROM mydata
) sub
ORDER BY person_id, contact_day; -- optional
db<>fiddle here
关于聚合 FILTER
子句:
- Aggregate columns with additional (distinct) filters
它简短直观,通常速度最快。参见:
“鸿沟与孤岛”的经典话题。一旦您知道要查找标签 gaps-and-islands,您就会发现大量相关或几乎相同的问题和答案,例如:
等等
我现在相应地标记了。
给出一个 table 赞:
person_id | contact_day | days_last_contact | dash_group |
---|---|---|---|
1 | 2015-02-09 | 1 | |
1 | 2015-05-01 | 81 | 2 |
1 | 2015-05-02 | 1 | 2 |
1 | 2015-05-03 | 1 | 2 |
1 | 2015-06-01 | 29 | 3 |
1 | 2015-08-01 | 61 | 4 |
1 | 2015-08-04 | 3 | 4 |
1 | 2015-09-01 | 28 | 5 |
2 | 2015-05-01 | 1 | |
2 | 2015-06-01 | 31 | 2 |
2 | 2015-07-01 | 30 | 3 |
3 | 2015-05-01 | 1 | |
3 | 2015-05-02 | 1 | 1 |
3 | 2015-05-04 | 2 | 1 |
3 | 2015-06-01 | 28 | 2 |
3 | 2015-06-02 | 1 | 2 |
3 | 2015-06-06 | 4 | 3 |
如何识别连续的天数 - 连续但允许最大间隔?
数据中的原始列是 person_id
和 contact_day
。我想按 person_id
和“连续”(附近的一组天数)进行划分。到目前为止,我的方法是首先计算自上次联系以来的天数 (days_last_contact
),然后尝试使用它来计算列 dash_group
,它在最大阈值内标记行 - 3 天例如。
如何计算 dash_group
?我通过减去 contact_day
来计算 days_last_contact
,它是 1 滞后的,按 person_id 分区并按日期排序)。
SELECT
contact_day - lag(contact_day, 1, NULL)
OVER (PARTITION BY person_id ORDER BY contact_day ASC)
AS days_last_contact
FROM mydata
;
但是我如何使用它来将 days_last_contact
低于某个阈值的行组合在一起? (本例中为 3 天)。因此,在此示例中,dash_group
2 for person_id
1 确定 5 月 1 日、2 日和 3 日临近,但该人的下一个日期是 6 月 1 日,这太远了 (29自上次联系以来的天数,大于阈值 3),因此它得到一个新的 dash_group
。同样,dash_group
4,将8月1日和8月4日归为一组,因为相差3,但是6月2日和6月6日(人3)相差4,然后分到不同的组.
环顾四周后,我发现了例如 this SO question where they point to the 'trick' #4 here,这很老套,但只适用于连续日期/无间隙系列,我需要允许任意间隙。
定义你的partition/pattern,根据你提供的日期数据,模式是相同的personid,按月划分排名。
所以 window 子句应该是 order by person_id, yearandmonth 那么分区应该是 person_id。
关键点是 get/compute pattern.So 这里我 deduced/guess 模式是 year/month 模式。
alter table mydata add column dateym text;
update mydata set dateym = to_char(contact_day,'YYYY-MM');
SELECT
person_id,
contact_day,
dateym,
rank() OVER (PARTITION BY (person_id, dateym) ORDER BY person_id,
contact_day),
dense_rank() OVER (PARTITION BY person_id ORDER BY person_id, dateym)
FROM
mydata;
间隙可以是任意的,但要用条件表达式来表示。
使用递归查询:
WITH RECURSIVE zzz AS (
SELECT person_id
, contact_day
, md.days_last_contact
, row_number() OVER(PARTITION BY person_id ORDER BY contact_day)
AS dash_group
FROM mydata md
WHERE NOT EXISTS ( -- only the group *leaders*
SELECT * FROM mydata nx
WHERE nx.person_id = md.person_id
AND nx.contact_day < md.contact_day
AND nx.contact_day >= md.contact_day -3
)
UNION ALL
SELECT md.person_id
, md.contact_day
, md.days_last_contact
, zzz.dash_group
FROM zzz
JOIN mydata md ON md.person_id = zzz.person_id
AND md.contact_day > zzz.contact_day
AND md.contact_day <= zzz.contact_day +3
AND NOT EXISTS ( SELECT * -- eliminate the middle men ...
FROM mydata nx
WHERE nx.person_id = md.person_id
AND nx.contact_day > zzz.contact_day
AND nx.contact_day < md.contact_day
)
)
SELECT * FROM zzz
ORDER BY person_id,contact_day
;
使用 window 函数可能会有更短的解决方案。
结果:
DROP SCHEMA
CREATE SCHEMA
SET
CREATE TABLE
INSERT 0 14
person_id | contact_day | days_last_contact | dash_group
-----------+-------------+-------------------+------------
1 | 2015-02-09 | | 1
1 | 2015-05-01 | 81 | 2
1 | 2015-05-02 | 1 | 2
1 | 2015-05-03 | 1 | 2
1 | 2015-06-01 | 29 | 3
1 | 2015-08-01 | 61 | 4
1 | 2015-08-04 | 3 | 4
1 | 2015-09-01 | 28 | 5
2 | 2015-05-01 | | 1
2 | 2015-06-01 | 31 | 2
2 | 2015-07-01 | 30 | 3
3 | 2015-05-01 | | 1
3 | 2015-05-02 | 1 | 1
3 | 2015-05-04 | 2 | 1
(14 rows)
如果我没理解错的话,可以试试在SUM
window函数中使用condition
如果我们在 mydata
table 中创建一个 suitable 索引(person_id
和 contact_day
列),我们可能会为此查询获得更好的性能。
查询#1
所以查询可能如下
SELECT
person_id,
contact_day,
days_last_contact,
SUM(CASE WHEN days_last_contact <= 3 THEN 0 ELSE 1 END) OVER(PARTITION BY person_id ORDER BY contact_day)
FROM mydata
ORDER BY person_id, contact_day
;
如果days_last_contact
需要计算,我们可以尝试用子查询来计算。
SELECT
person_id,
contact_day,
days_last_contact,
SUM(CASE WHEN days_last_contact <= 3 THEN 0 ELSE 1 END) OVER(PARTITION BY person_id ORDER BY contact_day)
FROM (
SELECT person_id,
contact_day,
contact_day - lag(contact_day)
OVER (PARTITION BY person_id ORDER BY contact_day ASC)
AS days_last_contact
FROM mydata
) t1
ORDER BY person_id, contact_day
;
person_id | contact_day | days_last_contact | sum |
---|---|---|---|
1 | 2015-02-09T00:00:00.000Z | 1 | |
1 | 2015-05-01T00:00:00.000Z | 81 | 2 |
1 | 2015-05-02T00:00:00.000Z | 1 | 2 |
1 | 2015-05-03T00:00:00.000Z | 1 | 2 |
1 | 2015-06-01T00:00:00.000Z | 29 | 3 |
1 | 2015-08-01T00:00:00.000Z | 61 | 4 |
1 | 2015-08-04T00:00:00.000Z | 3 | 4 |
1 | 2015-09-01T00:00:00.000Z | 28 | 5 |
2 | 2015-05-01T00:00:00.000Z | 1 | |
2 | 2015-06-01T00:00:00.000Z | 31 | 2 |
2 | 2015-07-01T00:00:00.000Z | 30 | 3 |
3 | 2015-05-01T00:00:00.000Z | 1 | |
3 | 2015-05-02T00:00:00.000Z | 1 | 1 |
3 | 2015-05-04T00:00:00.000Z | 2 | 1 |
3 | 2015-06-01T00:00:00.000Z | 28 | 2 |
3 | 2015-06-02T00:00:00.000Z | 1 | 2 |
3 | 2015-06-06T00:00:00.000Z | 4 | 3 |
在第二个 window 函数中计算间隙(大于给定的公差)形成您之后的组号:
SELECT person_id, contact_day
, count(*) FILTER (WHERE gap > 3) OVER (PARTITION BY person_id ORDER BY contact_day) AS dash_group
FROM (
SELECT person_id, contact_day
, contact_day - lag(contact_day) OVER (PARTITION BY person_id ORDER BY contact_day) AS gap
FROM mydata
) sub
ORDER BY person_id, contact_day; -- optional
db<>fiddle here
关于聚合 FILTER
子句:
- Aggregate columns with additional (distinct) filters
它简短直观,通常速度最快。参见:
“鸿沟与孤岛”的经典话题。一旦您知道要查找标签 gaps-and-islands,您就会发现大量相关或几乎相同的问题和答案,例如:
等等
我现在相应地标记了。