复杂 'Gaps and Islands' 问题

Question

我在 Postgres 数据库中有一个 table，如下所示：

person    |     eventdate     |  type 
--------------------------------------
<uuid-1>  |   2016-05-14      |   300
<uuid-3>  |   2016-05-14      |   300
<uuid-1>  |   2016-05-15      |   301
<uuid-1>  |   2016-05-16      |   301
<uuid-1>  |   2016-05-18      |   304
<uuid-1>  |   2016-05-22      |   300
<uuid-2>  |   2016-05-22      |   304
<uuid-2>  |   2016-05-27      |   301
<uuid-1>  |   2016-05-30      |   300
<uuid-1>  |   2016-06-01      |   300
<uuid-2>  |   2016-06-15      |   501
<uuid-2>  |   2016-06-16      |   301
<uuid-4>  |   2016-06-16      |   300
<uuid-5>  |   2016-06-20      |   300
<uuid-1>  |   2016-06-21      |   300
<uuid-2>  |   2016-06-21      |   300
<uuid-2>  |   2016-06-23      |   301
<uuid-2>  |   2016-06-30      |   300
<uuid-3>  |   2016-06-30      |   300
<uuid-4>  |   2016-06-30      |   300

table 包含员工因不同缺勤原因（类型）缺勤的每一天的非连续日条目。但是，缺勤期可能会跨越这些天中的几天，并且在之前相同类型的缺勤后 5 天内的任何缺勤条目仍被视为同一缺勤的一部分 'period'。

我需要获取每个员工缺勤期间的输出，包括这些期间的开始和结束日期，以及该多日期跨度期间内的总天数。

出于本报告的目的，不同类型的缺勤被视为相同的事实使情况更加复杂。因此在上面的示例中，类型 300、301、304 将被视为相同。

所以从我上面的例子来看，下面就是我想要的...

person    |     startdate     |       enddate     |  days   |  type 
--------------------------------------------------------------------
<uuid-1>  |     2016-05-14    |     2016-05-22    |   5     |   300
<uuid-3>  |     2016-05-14    |     2016-04-14    |   1     |   300
<uuid-2>  |     2016-05-22    |     2016-04-27    |   2     |   304
<uuid-1>  |     2016-05-30    |     2016-06-01    |   2     |   300
<uuid-2>  |     2016-06-15    |     2016-06-15    |   1     |   501
<uuid-2>  |     2016-06-16    |     2016-06-16    |   1     |   301
<uuid-4>  |     2016-06-16    |     2016-06-16    |   1     |   300
<uuid-5>  |     2016-06-20    |     2016-06-20    |   1     |   300
<uuid-1>  |     2016-06-21    |     2016-06-21    |   1     |   300
<uuid-2>  |     2016-06-21    |     2016-06-23    |   2     |   300
<uuid-2>  |     2016-06-30    |     2016-06-30    |   1     |   300
<uuid-3>  |     2016-06-30    |     2016-06-30    |   1     |   300
<uuid-4>  |     2016-06-30    |     2016-06-30    |   1     |   300

如何查询此 table 到此输出？

Answer 1

不清楚您如何确定每个时期的 type。我选择了最小的数字。

假设这个基本的 table 定义：

CREATE TABLE tbl (person text, eventdate date, type int);

基本上，我建议在两个嵌套子查询中使用window functions来识别同一时期（岛）的成员。然后汇总：

SELECT person, period
     , min(eventdate) AS startdate
     , max(eventdate) AS enddate
     , count(*)       AS days
     , min(type)      AS type
FROM  (
   SELECT person, eventdate, type
        , count(gap) OVER (PARTITION BY person ORDER BY eventdate) AS period
   FROM  (
      SELECT person, eventdate, type
           , CASE WHEN lag(eventdate) OVER (PARTITION BY person ORDER BY eventdate)
                     > eventdate - 6  -- within 5 days
                  THEN NULL           -- same period
                  ELSE TRUE           -- next period
             END AS gap
      FROM   tbl
      ) sub
   ) sub
GROUP  BY person, period
ORDER  BY person, period;

结果（基于您的示例数据）：

  person  | period | startdate  |  enddate   | days | type
----------+--------+------------+------------+------+------
 <uuid-1> |      1 | 2016-05-14 | 2016-05-22 |    5 |  300
 <uuid-1> |      2 | 2016-05-30 | 2016-06-01 |    2 |  300
 <uuid-1> |      3 | 2016-06-21 | 2016-06-21 |    1 |  300
 <uuid-2> |      1 | 2016-05-22 | 2016-05-27 |    2 |  301
 <uuid-2> |      2 | 2016-06-15 | 2016-06-23 |    4 |  300
 <uuid-2> |      3 | 2016-06-30 | 2016-06-30 |    1 |  300
 <uuid-3> |      1 | 2016-05-14 | 2016-05-14 |    1 |  300
 <uuid-3> |      2 | 2016-06-30 | 2016-06-30 |    1 |  300
 <uuid-4> |      1 | 2016-06-16 | 2016-06-16 |    1 |  300
 <uuid-4> |      2 | 2016-06-30 | 2016-06-30 |    1 |  300
 <uuid-5> |      1 | 2016-06-20 | 2016-06-20 |    1 |  300

如果同一个人的同一天可以用不同的类型多次输入，而您只想计算 distinct 天，则设为：count(DISTINCT eventdate) AS days。

复杂 'Gaps and Islands' 问题

Complex 'Gaps and Islands' issue

sql

postgresql

window-functions

gaps-and-islands