如何在连续 'streak' 天的行中添加运行计数

Question

感谢 Mike 建议添加 create/insert 语句。

create table test (
  pid integer not null,
  date date not null,
  primary key (pid, date)
);

insert into test values
  (1,'2014-10-1')
, (1,'2014-10-2')
, (1,'2014-10-3')
, (1,'2014-10-5')
, (1,'2014-10-7')
, (2,'2014-10-1')
, (2,'2014-10-2')
, (2,'2014-10-3')
, (2,'2014-10-5')
, (2,'2014-10-7');

我想添加一个新列 'days in current streak' 所以结果看起来像：

pid    | date      | in_streak
-------|-----------|----------
1      | 2014-10-1 | 1
1      | 2014-10-2 | 2
1      | 2014-10-3 | 3
1      | 2014-10-5 | 1
1      | 2014-10-7 | 1
2      | 2014-10-2 | 1
2      | 2014-10-3 | 2
2      | 2014-10-4 | 3
2      | 2014-10-6 | 1

我一直在尝试使用来自

的答案

PostgreSQL: find number of consecutive days up until now
Return rows of the latest 'streak' of data

但我不知道如何将 dense_rank() 技巧与其他 window 函数一起使用以获得正确的结果。

Answer 1

如果您在问题中包含 CREATE TABLE 语句和 INSERT 语句，您会得到更多关注。

create table test (
  pid integer not null,
  date date not null,
  primary key (pid, date)
);

insert into test values
(1,'2014-10-1'), (1,'2014-10-2'), (1,'2014-10-3'), (1,'2014-10-5'),
(1,'2014-10-7'), (2,'2014-10-1'), (2,'2014-10-2'), (2,'2014-10-3'),
(2,'2014-10-5'), (2,'2014-10-7');

原理很简单。一连串不同的连续日期减去 row_number() 是一个常数。您可以按常量分组，然后对结果进行 dense_rank()。

with grouped_dates as (
  select pid, date, 
         (date - (row_number() over (partition by pid order by date) || ' days')::interval)::date as grouping_date
  from test
)
select * , dense_rank() over (partition by grouping_date order by date) as in_streak
from grouped_dates
order by pid, date

pid  date         grouping_date  in_streak
--
1    2014-10-01   2014-09-30     1
1    2014-10-02   2014-09-30     2
1    2014-10-03   2014-09-30     3
1    2014-10-05   2014-10-01     1
1    2014-10-07   2014-10-02     1
2    2014-10-01   2014-09-30     1
2    2014-10-02   2014-09-30     2
2    2014-10-03   2014-09-30     3
2    2014-10-05   2014-10-01     1
2    2014-10-07   2014-10-02     1

Answer 2

在此 table 的基础上（不使用 SQL keyword "date" 作为列名。）：

CREATE TABLE tbl(
  pid int
, the_date date
, PRIMARY KEY (pid, the_date)
);

查询：

SELECT pid, the_date
     , row_number() OVER (PARTITION BY pid, grp ORDER BY the_date) AS in_streak
FROM  (
   SELECT *
        , the_date - '2000-01-01'::date
        - row_number() OVER (PARTITION BY pid ORDER BY the_date) AS grp
   FROM   tbl
) sub
ORDER  BY pid, the_date;

从另一个 date 中减去一个 date 得到一个 integer。由于您正在寻找连续的天数，因此接下来的每一行都会增加一个。如果我们从中减去 row_number()，整个连胜最终会在每个 pid 的同一组 (grp) 中结束。然后每组发个数就简单了

grp是用两次减法算出来的，应该是最快的。一个同样快速的替代方案可能是：

the_date - row_number() OVER (PARTITION BY pid ORDER BY the_date) * interval '1d' AS grp

一乘一减。字符串连接和铸造更昂贵。使用 EXPLAIN ANALYZE.

进行测试

不要忘记在两个步骤中另外按pid进行分区，否则您会不经意地混淆应该分开的组。

使用子查询，因为它通常比 CTE 快。这里没有什么是普通子查询做不到的。

既然你提到了它：dense_rank() 在这里显然 不需要 。基本 row_number() 就可以了。

如何在连续 'streak' 天的行中添加运行计数

How to add a running count to rows in a 'streak' of consecutive days

sql

postgresql

date-arithmetic

window-functions

gaps-and-islands

如何在连续 'streak' 天的行中添加 运行 计数

How to add a running count to rows in a 'streak' of consecutive days

sql

postgresql

date-arithmetic

window-functions

gaps-and-islands

如何在连续 'streak' 天的行中添加运行计数