由于 table 主键配置,使用 CTE 的 Redshift 重复行计数不匹配

Redshift duplicated rows count mismatch using CTE due to table primary key configuration

看来我遇到了 Redshift bug/inconsistency。我首先解释我的原始问题,并在下面包含一个可重现的示例。

原问题

我有一个 table,在 Redshift 中有许多列和一些重复的行。我尝试使用 CTE 和两种不同的方法来确定唯一行的数量:DISTINCT 和 GROUP BY。
GROUP BY 方法看起来像这样:

WITH duplicated_rows as 
(SELECT *, COUNT(*) AS q
FROM my_schema.my_table
GROUP BY  1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 
16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 
31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 
46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 
61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 
76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 
91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104)
---
SELECT COUNT(*) count_unique_rows, SUM(q) count_total_rows
FROM duplicated_rows

通过这个查询我得到了这个结果:

count_unique_rows | count_total_rows
------------------------------------
      27          |        83

然后我使用DISTINCT方法

WITH unique_rows as 
(SELECT DISTINCT *
FROM my_schema.my_table)
---
SELECT COUNT(*) as count_unique_rows
FROM unique_rows

我得到了这个结果:

count_unique_rows 
-----------------
      63

因此带有 GROUP BY 的 CTE 似乎指示 27 个唯一行,而带有 DISTINCT 的 CTE 显示 63 个唯一行。
作为下一个故障排除步骤,我在 CTE 外部执行了 GROUP BY,它产生了 63 行!
我还将 83 行原始行导出到 excel 并应用了删除重复功能,剩下 63 行,所以这似乎是正确的数字。
我无法理解的是,当我将 CTE 与 GROUP BY 结合使用时,数字 27 是从哪里来的。
是否存在我不知道的 CTE 和 Redshift 限制?这是我的代码中的错误吗? 是 Redshift 中的错误吗?
任何帮助澄清这个谜团的人都将不胜感激!!

可重现的例子

创建并填充 table

create table my_schema.students
(name VARCHAR(100),
day DATE,
course VARCHAR(100),
country VARCHAR(100),
address VARCHAR(100),
age INTEGER,
PRIMARY KEY (name))

INSERT INTO my_schema.students 
VALUES
('Alan', '2000-07-15', 'Physics', 'CA', '12th Street', NULL),
('Alan', '2021-01-15', 'Math', 'USA', '8th Avenue', 21),
('Jane', '2021-01-16', 'Chemistry', 'USA', NULL, 21),
('Jane', '2021-01-16', 'Chemistry', 'USA', NULL, 21),
('Patrick', '2021-07-16', 'Chemistry', NULL, NULL, 21),
('Patrick', '2021-07-16', 'Chemistry', NULL, NULL, 21),
('Kate', '2018-07-20', 'Literature', 'AR', '8th and 23th', 18),
('Kate', '2021-10-20', 'Philosophy', 'ES', NULL, 30);

使用 CTE 和 GROUP BY 计算唯一行

WITH duplicated_rows as 
(SELECT *, COUNT(*) AS q
FROM my_schema.students
GROUP BY  1, 2, 3, 4, 5, 6)
---
SELECT COUNT(*) count_unique_rows, SUM(q) count_total_rows
FROM duplicated_rows

结果不正确!

count_unique_rows | count_total_rows
-------------------------------------
      4           |         8

使用 CTE 和 DISTINCT 计算唯一行

WITH unique_rows as 
(SELECT DISTINCT *
FROM my_schema.students)
---
SELECT COUNT(*) as count_unique_rows 
FROM unique_rows

结果正确!

count_unique_rows
-----------------
      6

问题的核心似乎是主键,Redshift 不强制执行,而是将其用于某种惰性评估以确定 CTE 中的行差异,从而导致结果不一致。

奇怪的行为是由这一行引起的:

PRIMARY KEY (name)

来自Defining table constraints - Amazon Redshift

Uniqueness, primary key, and foreign key constraints are informational only; they are not enforced by Amazon Redshift. Nonetheless, primary keys and foreign keys are used as planning hints and they should be declared if your ETL process or some other process in your application enforces their integrity.

For example, the query planner uses primary and foreign keys in certain statistical computations. It does this to infer uniqueness and referential relationships that affect subquery decorrelation techniques. By doing this, it can order large numbers of joins and eliminate redundant joins.

The planner leverages these key relationships, but it assumes that all keys in Amazon Redshift tables are valid as loaded. If your application allows invalid foreign keys or primary keys, some queries could return incorrect results. For example, a SELECT DISTINCT query might return duplicate rows if the primary key is not unique. Do not define key constraints for your tables if you doubt their validity. On the other hand, you should always declare primary and foreign keys and uniqueness constraints when you know that they are valid.

在你的示例数据中,主键显然不能是name,因为有多行具有相同的name。这违反了 Redshift 所做的假设,并可能导致不正确的结果。

如果去掉PRIMARY KEY (name)行,数据结果是正确的。

(仅供参考,我通过 运行 你在 sqlfiddle.com 中针对 PostgreSQL 数据库的命令发现了这一点。它不允许插入数据,因为它违反了 PRIMARY KEY 条件。 )