redshift sum() window 函数行为

Question

我正在尝试使用 Redshift SUM() 和 window 函数来执行累加和。我的数据如下所示：

ID	item_date	item_count
12	01/01/2019	11
12	02/01/2019	8
12	03/01/2019	0
12	04/01/2019	5
12	05/01/2019	21
12	06/01/2019	0

目前，我的总结是这样的：

SUM(item_count) over (partition by ID order by item_date rows unbounded preceding) as cumulative_count

但是它产生了这个输出：

ID	item_date	item_count	cumulative_count
12	01/01/2019	11	11
12	02/01/2019	8	19
12	03/01/2019	0	0
12	04/01/2019	5	24
12	05/01/2019	21	45
12	06/01/2019	0	0

行为是正确的，除了 item_count = 0 时。显然我想要的输出是：

ID	item_date	item_count	cumulative_count
12	01/01/2019	11	11
12	02/01/2019	8	19
12	03/01/2019	0	19
12	04/01/2019	5	24
12	05/01/2019	21	45
12	06/01/2019	0	45

我研究过使用 LAST_VALUE() 函数作为回填零值的方法，但在 redshift 中你不能嵌套 window 函数。

有人以前看过这个吗？

Answer 1

Redshift 是一个久经考验且真实的数据库，因此在基本功能中出现错误似乎不太可能，但应该检查一下。我在我的集群上一起完成了这个测试用例 SQL 和运行，它产生了预期的结果。

create table test (ID int,  item_date date, item_count int);

insert into test values 
(12, '01/01/2019', 11),
(12, '02/01/2019', 8),
(12, '03/01/2019', 0),
(12, '04/01/2019', 5),
(12, '05/01/2019', 21),
(12, '06/01/2019', 0);

select *, SUM(item_count) over (partition by ID order by item_date rows unbounded preceding) as cumulative_count
from test;

它产生了：

id | item_date  | item_count | cumulative_count
---+------------+------------+-----------------
12 | 2019-01-01 |         11 |               11
12 | 2019-02-01 |          8 |               19
12 | 2019-03-01 |          0 |               19
12 | 2019-04-01 |          5 |               24
12 | 2019-05-01 |         21 |               45
12 | 2019-06-01 |          0 |               45

我的集群版本是 Redshift 1.0.34272

此测试代码是否在您的集群上产生了正确的答案？如果确实如此，那么您的 query/data/situation 就会发生一些微妙的事情。如果没有，那么我会打包并提交支持票。

============================================= =======

思考这个问题，我想到了这是怎么发生的。如果您的 ID 是文本并且其中包含非打印字符，那么它们将被视为不同的分区。例如：

drop table if exists test;
create table test (ID varchar(8),   item_date date, item_count int);

insert into test values 
('12', '01/01/2019', 11),
('12', '02/01/2019', 8),
('12    ', '03/01/2019', 0),
('12', '04/01/2019', 5),
('12', '05/01/2019', 21),
('12    ', '06/01/2019', 0);

select *, SUM(item_count) over (partition by ID order by item_date rows unbounded preceding) as cumulative_count
from test
order by item_date;

现在这只是可能发生的一种方式。我确定还有其他人。

redshift sum() window 函数行为

redshift sum() window function behavior

sql

amazon-redshift