从 window 查询中消除重复项的正确方法
Correct way to eliminate duplicates from a window query
我正在尝试根据在特定日期收到的诊断数据为软件安装创建版本历史记录。数据在 PostgreSQL 数据库中:
SELECT version();
version
-------------------------------------------------------------------------------------------------------
PostgreSQL 10.14 on x86_64-pc-linux-gnu, compiled by x86_64-unknown-linux-gnu-gcc (GCC) 4.9.4, 64-bit
table 的架构是这样的:
CREATE TABLE cluster_info (
cluster_id uuid,
date timestamp,
version text,
PRIMARY KEY (cluster_id, date)
);
相关数据如下:
select cluster_id, version, date
from cluster_info
where cluster_id = 'e2865aec-0ce1-11ec-afda-0242c0a8a003'
order by date;
cluster_id | date | version
--------------------------------------+---------------------+--------------
e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-03-15 10:30:47 | 6.0.5
e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-05-03 20:32:33 | 6.0.5
e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-05-08 14:57:05 | 6.0.7
e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-05-20 16:59:45 | 6.0.7
e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-05-21 00:21:43 | 6.0.5, 6.0.7
e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-05-21 18:45:45 | 6.0.5, 6.0.7
e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-05-22 20:05:10 | 6.0.5, 6.0.6
e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-05-23 11:54:39 | 6.0.5, 6.0.6
e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-05-24 15:01:09 | 6.0.7
e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-05-24 19:21:14 | 6.0.7
e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-05-28 20:06:29 | 6.0.6
e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-07-09 05:20:32 | 6.0.6
e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-07-11 12:05:03 | 6.0.8
e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-07-17 17:46:10 | 6.0.8
e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-07-24 14:44:55 | 6.0.6
e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-07-26 14:54:33 | 6.0.6
我的第一直觉是将 min
和 max
与 group by
一起使用,但集群在升级后可能会降级到以前的版本。在那种情况下,当集群在给定版本上时,我想显示每个时期的单独时间跨度,而 group by
无法完成此操作。
我尝试使用按版本划分的 min
和 max
window 函数,它们也没有像我预期的那样工作:
select distinct * from (select
version,
min(date) over (partition by version),
max(date) over (partition by version)
from cluster_info
where cluster_id = 'e2865aec-0ce1-11ec-afda-0242c0a8a003'
order by date) x;
version | min | max
--------------+---------------------+---------------------
6.0.5 | 2019-03-15 10:30:47 | 2019-05-03 20:32:33
6.0.5, 6.0.6 | 2019-05-22 20:05:10 | 2019-05-23 11:54:39
6.0.5, 6.0.7 | 2019-05-21 00:21:43 | 2019-05-21 18:45:45
6.0.6 | 2019-05-28 20:06:29 | 2019-07-26 14:54:33
6.0.7 | 2019-05-08 14:57:05 | 2019-05-24 19:21:14
6.0.8 | 2019-07-11 12:05:03 | 2019-07-17 17:46:10
正确的做法是什么?
编辑:更新为包括版本和架构,并使用展示降级问题并表明我的初始解决方案不正确的示例数据集。
如果版本降级(或 NULL 值?)是可能的,您需要更加复杂:
SELECT min(version) AS version, min(date), max(date)
FROM (
SELECT version, date
, count(*) FILTER (WHERE step IS NOT FALSE) OVER (ORDER BY date) AS grp
FROM (
SELECT version, date
, lag(version) OVER (ORDER BY date) <> version AS step
FROM cluster_info
WHERE cluster_id = '0f4ce21e-0d08-11ec-b209-0242c0a8c004'
ORDER BY date
) sub1
) sub2
GROUP BY grp;
db<>fiddle here(示例数据扩展版本降级和未知版本)
参见(详细解释和更多链接):
我正在尝试根据在特定日期收到的诊断数据为软件安装创建版本历史记录。数据在 PostgreSQL 数据库中:
SELECT version();
version
-------------------------------------------------------------------------------------------------------
PostgreSQL 10.14 on x86_64-pc-linux-gnu, compiled by x86_64-unknown-linux-gnu-gcc (GCC) 4.9.4, 64-bit
table 的架构是这样的:
CREATE TABLE cluster_info (
cluster_id uuid,
date timestamp,
version text,
PRIMARY KEY (cluster_id, date)
);
相关数据如下:
select cluster_id, version, date
from cluster_info
where cluster_id = 'e2865aec-0ce1-11ec-afda-0242c0a8a003'
order by date;
cluster_id | date | version
--------------------------------------+---------------------+--------------
e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-03-15 10:30:47 | 6.0.5
e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-05-03 20:32:33 | 6.0.5
e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-05-08 14:57:05 | 6.0.7
e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-05-20 16:59:45 | 6.0.7
e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-05-21 00:21:43 | 6.0.5, 6.0.7
e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-05-21 18:45:45 | 6.0.5, 6.0.7
e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-05-22 20:05:10 | 6.0.5, 6.0.6
e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-05-23 11:54:39 | 6.0.5, 6.0.6
e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-05-24 15:01:09 | 6.0.7
e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-05-24 19:21:14 | 6.0.7
e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-05-28 20:06:29 | 6.0.6
e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-07-09 05:20:32 | 6.0.6
e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-07-11 12:05:03 | 6.0.8
e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-07-17 17:46:10 | 6.0.8
e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-07-24 14:44:55 | 6.0.6
e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-07-26 14:54:33 | 6.0.6
我的第一直觉是将 min
和 max
与 group by
一起使用,但集群在升级后可能会降级到以前的版本。在那种情况下,当集群在给定版本上时,我想显示每个时期的单独时间跨度,而 group by
无法完成此操作。
我尝试使用按版本划分的 min
和 max
window 函数,它们也没有像我预期的那样工作:
select distinct * from (select
version,
min(date) over (partition by version),
max(date) over (partition by version)
from cluster_info
where cluster_id = 'e2865aec-0ce1-11ec-afda-0242c0a8a003'
order by date) x;
version | min | max
--------------+---------------------+---------------------
6.0.5 | 2019-03-15 10:30:47 | 2019-05-03 20:32:33
6.0.5, 6.0.6 | 2019-05-22 20:05:10 | 2019-05-23 11:54:39
6.0.5, 6.0.7 | 2019-05-21 00:21:43 | 2019-05-21 18:45:45
6.0.6 | 2019-05-28 20:06:29 | 2019-07-26 14:54:33
6.0.7 | 2019-05-08 14:57:05 | 2019-05-24 19:21:14
6.0.8 | 2019-07-11 12:05:03 | 2019-07-17 17:46:10
正确的做法是什么?
编辑:更新为包括版本和架构,并使用展示降级问题并表明我的初始解决方案不正确的示例数据集。
如果版本降级(或 NULL 值?)是可能的,您需要更加复杂:
SELECT min(version) AS version, min(date), max(date)
FROM (
SELECT version, date
, count(*) FILTER (WHERE step IS NOT FALSE) OVER (ORDER BY date) AS grp
FROM (
SELECT version, date
, lag(version) OVER (ORDER BY date) <> version AS step
FROM cluster_info
WHERE cluster_id = '0f4ce21e-0d08-11ec-b209-0242c0a8c004'
ORDER BY date
) sub1
) sub2
GROUP BY grp;
db<>fiddle here(示例数据扩展版本降级和未知版本)
参见(详细解释和更多链接):