如何形成允许给定最大间隔的连续日期组?

How to form groups of consecutive dates allowing for a given maximum gap?

给出一个 table 赞:

person_id contact_day days_last_contact dash_group
1 2015-02-09 1
1 2015-05-01 81 2
1 2015-05-02 1 2
1 2015-05-03 1 2
1 2015-06-01 29 3
1 2015-08-01 61 4
1 2015-08-04 3 4
1 2015-09-01 28 5
2 2015-05-01 1
2 2015-06-01 31 2
2 2015-07-01 30 3
3 2015-05-01 1
3 2015-05-02 1 1
3 2015-05-04 2 1
3 2015-06-01 28 2
3 2015-06-02 1 2
3 2015-06-06 4 3

另见 DB Fiddle example

如何识别连续的天数 - 连续但允许最大间隔?

数据中的原始列是 person_idcontact_day。我想按 person_id 和“连续”(附近的一组天数)进行划分。到目前为止,我的方法是首先计算自上次联系以来的天数 (days_last_contact),然后尝试使用它来计算列 dash_group,它在最大阈值内标记行 - 3 天例如。

如何计算 dash_group?我通过减去 contact_day 来计算 days_last_contact,它是 1 滞后的,按 person_id 分区并按日期排序)。

SELECT 
  contact_day - lag(contact_day, 1, NULL) 
    OVER (PARTITION BY person_id ORDER BY contact_day ASC) 
    AS days_last_contact
FROM mydata
;

但是我如何使用它来将 days_last_contact 低于某个阈值的行组合在一起? (本例中为 3 天)。因此,在此示例中,dash_group 2 for person_id 1 确定 5 月 1 日、2 日和 3 日临近,但该人的下一个日期是 6 月 1 日,这太远了 (29自上次联系以来的天数,大于阈值 3),因此它得到一个新的 dash_group。同样,dash_group4,将8月1日和8月4日归为一组,因为相差3,但是6月2日和6月6日(人3)相差4,然后分到不同的组.

环顾四周后,我发现了例如 this SO question where they point to the 'trick' #4 here,这很老套,但只适用于连续日期/无间隙系列,我需要允许任意间隙。

demo

定义你的partition/pattern,根据你提供的日期数据,模式是相同的personid,按月划分排名。
所以 window 子句应该是 order by person_id, yearandmonth 那么分区应该是 person_id
关键点是 get/compute pattern.So 这里我 deduced/guess 模式是 year/month 模式。

alter table mydata add column dateym text;
update mydata set dateym = to_char(contact_day,'YYYY-MM');
SELECT
    person_id,
    contact_day,
    dateym,
    rank() OVER (PARTITION BY (person_id, dateym) ORDER BY person_id,
    contact_day),
dense_rank() OVER (PARTITION BY person_id ORDER BY person_id, dateym)
FROM
    mydata;

间隙可以是任意的,但要用条件表达式来表示。

使用递归查询:


WITH RECURSIVE zzz AS (
    SELECT person_id
    , contact_day
    , md.days_last_contact
    , row_number() OVER(PARTITION BY person_id ORDER BY contact_day)
        AS dash_group
    FROM mydata md
    WHERE NOT EXISTS ( -- only the group *leaders*
            SELECT * FROM mydata nx
            WHERE nx.person_id = md.person_id
            AND nx.contact_day < md.contact_day
            AND nx.contact_day >= md.contact_day -3
            )
UNION ALL
    SELECT md.person_id
    , md.contact_day
    , md.days_last_contact
    , zzz.dash_group
    FROM zzz
    JOIN mydata md ON md.person_id = zzz.person_id
            AND md.contact_day > zzz.contact_day
            AND md.contact_day <= zzz.contact_day +3
            AND NOT EXISTS ( SELECT * -- eliminate the middle men ...
                    FROM mydata nx
                    WHERE nx.person_id = md.person_id
                    AND nx.contact_day > zzz.contact_day
                    AND nx.contact_day < md.contact_day
            )
    )
SELECT * FROM zzz
ORDER BY person_id,contact_day
    ;

使用 window 函数可能会有更短的解决方案。

结果:


DROP SCHEMA
CREATE SCHEMA
SET
CREATE TABLE
INSERT 0 14
 person_id | contact_day | days_last_contact | dash_group 
-----------+-------------+-------------------+------------
         1 | 2015-02-09  |                   |          1
         1 | 2015-05-01  |                81 |          2
         1 | 2015-05-02  |                 1 |          2
         1 | 2015-05-03  |                 1 |          2
         1 | 2015-06-01  |                29 |          3
         1 | 2015-08-01  |                61 |          4
         1 | 2015-08-04  |                 3 |          4
         1 | 2015-09-01  |                28 |          5
         2 | 2015-05-01  |                   |          1
         2 | 2015-06-01  |                31 |          2
         2 | 2015-07-01  |                30 |          3
         3 | 2015-05-01  |                   |          1
         3 | 2015-05-02  |                 1 |          1
         3 | 2015-05-04  |                 2 |          1
(14 rows)

如果我没理解错的话,可以试试在SUMwindow函数中使用condition

如果我们在 mydata table 中创建一个 suitable 索引(person_idcontact_day 列),我们可能会为此查询获得更好的性能。

查询#1

所以查询可能如下

SELECT 
  person_id, 
  contact_day, 
  days_last_contact,
  SUM(CASE WHEN days_last_contact <= 3 THEN 0 ELSE 1 END) OVER(PARTITION BY person_id ORDER BY contact_day) 
FROM mydata
ORDER BY person_id, contact_day
;

如果days_last_contact需要计算,我们可以尝试用子查询来计算。

SELECT 
  person_id, 
  contact_day, 
  days_last_contact,
  SUM(CASE WHEN days_last_contact <= 3 THEN 0 ELSE 1 END) OVER(PARTITION BY person_id ORDER BY contact_day) 
FROM (
    SELECT person_id,
           contact_day,
           contact_day - lag(contact_day) 
        OVER (PARTITION BY person_id ORDER BY contact_day ASC) 
        AS days_last_contact
    FROM mydata
) t1
ORDER BY person_id, contact_day
;
person_id contact_day days_last_contact sum
1 2015-02-09T00:00:00.000Z 1
1 2015-05-01T00:00:00.000Z 81 2
1 2015-05-02T00:00:00.000Z 1 2
1 2015-05-03T00:00:00.000Z 1 2
1 2015-06-01T00:00:00.000Z 29 3
1 2015-08-01T00:00:00.000Z 61 4
1 2015-08-04T00:00:00.000Z 3 4
1 2015-09-01T00:00:00.000Z 28 5
2 2015-05-01T00:00:00.000Z 1
2 2015-06-01T00:00:00.000Z 31 2
2 2015-07-01T00:00:00.000Z 30 3
3 2015-05-01T00:00:00.000Z 1
3 2015-05-02T00:00:00.000Z 1 1
3 2015-05-04T00:00:00.000Z 2 1
3 2015-06-01T00:00:00.000Z 28 2
3 2015-06-02T00:00:00.000Z 1 2
3 2015-06-06T00:00:00.000Z 4 3

View on DB Fiddle

在第二个 window 函数中计算间隙(大于给定的公差)形成您之后的组号:

SELECT person_id, contact_day
     , count(*) FILTER (WHERE gap > 3) OVER (PARTITION BY person_id ORDER BY contact_day) AS dash_group
FROM  (
   SELECT person_id, contact_day
        , contact_day - lag(contact_day) OVER (PARTITION BY person_id ORDER BY contact_day) AS gap
   FROM   mydata
   ) sub
ORDER  BY person_id, contact_day;  -- optional

db<>fiddle here

关于聚合 FILTER 子句:

  • Aggregate columns with additional (distinct) filters

它简短直观,通常速度最快。参见:

“鸿沟与孤岛”的经典话题。一旦您知道要查找标签 ,您就会发现大量相关或几乎相同的问题和答案,例如:

等等

我现在相应地标记了。