SQLite Window 函数
SQLite Window functions
这是我的数据库的简化 ER 图:
我想要检索的是,对于每个 vendor_item:
- 最高价格(不包括最后一次捕获)
- 最低价格(不包括最后一次捕获)
- 当前价格(即上次捕获)
这是 PRICE_DATA
table 的一些示例数据,可以让您了解一下:
vendor_item_id
capture_ts
price
124
2022-03-02 09:00:12.851043
46.78
124
2022-03-02 14:07:49.423343
42.99
124
2022-03-04 08:20:07.636140
43.99
124
2022-03-05 08:29:20.421764
42.99
124
2022-03-08 08:33:59.043372
42.99
129
2022-03-02 08:55:14.401816
21.52
129
2022-03-02 14:11:20.544427
25.54
129
2022-03-04 08:24:06.976667
25.72
129
2022-03-08 08:22:46.734662
30.83
132
2022-03-02 09:04:18.144494
41.99
132
2022-03-03 08:29:15.981712
42.99
132
2022-03-04 08:27:39.327779
41.99
132
2022-03-07 08:29:41.236009
42.99
132
2022-03-08 08:27:44.318570
40.99
这是我目前的 SQL 声明:
select distinct vendor_item_id
,last_value(price) over win as curr_price
,min(price) over win as low_price
,max(price) over win as high_price
from price_data
window win as (partition by vendor_item_id
order by capture_ts
rows between unbounded preceding
and unbounded following);
虽然这或多或少地提供了我正在寻找的东西,但有几个问题:
最高价和最低价考虑了所有条记录,而不是排除最近捕获的。
如果我不在查询中添加 distinct
,我最终会得到重复的记录(这可能是我的错,因为我没有正确掌握窗口功能)。
想要的结果:
vendor_item_id
curr_price
low_price
high_price
124
42.99
42.99
46.78
129
30.83
21.52
25.72
132
40.99
41.99
42.99
感谢您的帮助!
使用 returns 每个 vendor_item_id
的最大值 capture_ts
的 CTE,然后通过条件聚合获得 low_price
和 high_price
:
WITH cte AS (
SELECT *, MAX(capture_ts) OVER (PARTITION BY vendor_item_id) max_capture_ts
FROM price_data
)
SELECT DISTINCT vendor_item_id,
FIRST_VALUE(price) OVER (PARTITION BY vendor_item_id ORDER BY capture_ts DESC) curr_price,
MIN(CASE WHEN capture_ts < max_capture_ts THEN price END) OVER (PARTITION BY vendor_item_id) low_price,
MAX(CASE WHEN capture_ts < max_capture_ts THEN price END) OVER (PARTITION BY vendor_item_id) high_price
FROM cte;
参见demo。
我最终使用 CTE 和常规聚合函数来解决问题:
with v_last_capture as (
select vendor_item_id
,max(capture_ts) last_capture_ts
from price_data pd
group by vendor_item_id
)
, v_curr_price as (
select pd.*
from price_data pd
inner join v_last_capture vc
on (pd.vendor_item_id = vc.vendor_item_id and
pd.capture_ts = vc.last_capture_ts)
)
, v_other_prices as (
select vendor_item_id
,min(pd.price) as min_price
,max(pd.price) as max_price
from price_data pd
where id not in (select id from v_curr_price)
group by vendor_item_id
)
select vc.id
,vc.vendor_item_id
,vc.price as curr_price
,vc.stock
,vo.min_price
,vo.max_price
from v_curr_price vc
left join v_other_prices vo on (vc.vendor_item_id = vo.vendor_item_id)
解释计划:
QUERY PLAN
|--MATERIALIZE 4
| |--SCAN TABLE price_data AS pd
| `--USE TEMP B-TREE FOR GROUP BY
|--MATERIALIZE 5
| |--SCAN TABLE price_data AS pd
| |--LIST SUBQUERY 6
| | |--MATERIALIZE 8
| | | |--SCAN TABLE price_data AS pd
| | | `--USE TEMP B-TREE FOR GROUP BY
| | |--SCAN SUBQUERY 8 AS vc
| | `--SEARCH TABLE price_data AS pd USING AUTOMATIC COVERING INDEX (vendor_item_id=? AND capture_ts=?)
| `--USE TEMP B-TREE FOR GROUP BY
|--SCAN TABLE price_data AS pd
|--SEARCH SUBQUERY 4 AS vc USING AUTOMATIC COVERING INDEX (vendor_item_id=?)
`--SEARCH SUBQUERY 5 AS vo USING AUTOMATIC COVERING INDEX (vendor_item_id=?)
的答案同样有效(而且查询更简洁)。这是他的查询的解释计划:
QUERY PLAN
|--CO-ROUTINE 3
| |--CO-ROUTINE 4
| | |--CO-ROUTINE 1
| | | |--CO-ROUTINE 5
| | | | |--SCAN TABLE price_data
| | | | `--USE TEMP B-TREE FOR ORDER BY
| | | `--SCAN SUBQUERY 5
| | |--SCAN SUBQUERY 1
| | `--USE TEMP B-TREE FOR ORDER BY
| |--SCAN SUBQUERY 4
| `--USE TEMP B-TREE FOR ORDER BY
|--SCAN SUBQUERY 3
`--USE TEMP B-TREE FOR DISTINCT
您可以使用 window filters 删除满足“最新捕获除外”要求的最后一行
select distinct
p.vendor_item_id
,last_value(p.price) over vendor_item as curr_price
,min(price) filter (where p.capture_ts < latest.capture_ts) over vendor_item as low_price
,max(price) filter (where p.capture_ts < latest.capture_ts) over vendor_item as high_price
from
price_data p
inner join (
select vendor_item_id, max(capture_ts) capture_ts from price_data group by vendor_item_id
) latest on latest.vendor_item_id = p.vendor_item_id
window
vendor_item as (
partition by p.vendor_item_id
order by p.capture_ts
rows between unbounded preceding and unbounded following
);
结果
124 42.99 42.99 46.78
129 30.83 21.52 25.72
132 40.99 41.99 42.99
我想 capture_ts
对于 vendor_item_id
是唯一的,否则你必须创建一个更智能的过滤器。
裸查询计划 price_data
table 未定义索引:
QUERY PLAN
|--CO-ROUTINE 3
| |--MATERIALIZE 1
| | |--SCAN TABLE price_data
| | `--USE TEMP B-TREE FOR GROUP BY
| |--SCAN TABLE price_data AS p
| |--SEARCH SUBQUERY 1 AS latest USING AUTOMATIC COVERING INDEX (vendor_item_id=?)
| `--USE TEMP B-TREE FOR ORDER BY
|--SCAN SUBQUERY 3
`--USE TEMP B-TREE FOR DISTINCT
定义覆盖索引 (create index ix_price_data on price_data (vendor_item_id, capture_ts, price)
) 后,事情会变得简单一点:
QUERY PLAN
|--CO-ROUTINE 3
| |--MATERIALIZE 1
| | `--SCAN TABLE price_data USING COVERING INDEX ix_price_data
| |--SCAN SUBQUERY 1 AS latest
| |--SEARCH TABLE price_data AS p USING COVERING INDEX ix_price_data (vendor_item_id=?)
| `--USE TEMP B-TREE FOR ORDER BY
|--SCAN SUBQUERY 3
`--USE TEMP B-TREE FOR DISTINCT
由于覆盖索引会增加数据库大小(毕竟所有数据都作为索引中的副本存在),您可以决定要 re-create price_data
作为集群索引,即创建 table WITHOUT ROWID
并将 vendor_item_id, capture_ts
标记为主键。您也可以删除 then-useless id
列。
这样您将获得与显式索引相同的性能,但不会增加数据库的大小(实际上 table 应该明显变小,因为 row_id 消失了).查询计划保持不变。
这是我的数据库的简化 ER 图:
我想要检索的是,对于每个 vendor_item:
- 最高价格(不包括最后一次捕获)
- 最低价格(不包括最后一次捕获)
- 当前价格(即上次捕获)
这是 PRICE_DATA
table 的一些示例数据,可以让您了解一下:
vendor_item_id | capture_ts | price |
---|---|---|
124 | 2022-03-02 09:00:12.851043 | 46.78 |
124 | 2022-03-02 14:07:49.423343 | 42.99 |
124 | 2022-03-04 08:20:07.636140 | 43.99 |
124 | 2022-03-05 08:29:20.421764 | 42.99 |
124 | 2022-03-08 08:33:59.043372 | 42.99 |
129 | 2022-03-02 08:55:14.401816 | 21.52 |
129 | 2022-03-02 14:11:20.544427 | 25.54 |
129 | 2022-03-04 08:24:06.976667 | 25.72 |
129 | 2022-03-08 08:22:46.734662 | 30.83 |
132 | 2022-03-02 09:04:18.144494 | 41.99 |
132 | 2022-03-03 08:29:15.981712 | 42.99 |
132 | 2022-03-04 08:27:39.327779 | 41.99 |
132 | 2022-03-07 08:29:41.236009 | 42.99 |
132 | 2022-03-08 08:27:44.318570 | 40.99 |
这是我目前的 SQL 声明:
select distinct vendor_item_id
,last_value(price) over win as curr_price
,min(price) over win as low_price
,max(price) over win as high_price
from price_data
window win as (partition by vendor_item_id
order by capture_ts
rows between unbounded preceding
and unbounded following);
虽然这或多或少地提供了我正在寻找的东西,但有几个问题:
最高价和最低价考虑了所有条记录,而不是排除最近捕获的。
如果我不在查询中添加
distinct
,我最终会得到重复的记录(这可能是我的错,因为我没有正确掌握窗口功能)。
想要的结果:
vendor_item_id | curr_price | low_price | high_price |
---|---|---|---|
124 | 42.99 | 42.99 | 46.78 |
129 | 30.83 | 21.52 | 25.72 |
132 | 40.99 | 41.99 | 42.99 |
感谢您的帮助!
使用 returns 每个 vendor_item_id
的最大值 capture_ts
的 CTE,然后通过条件聚合获得 low_price
和 high_price
:
WITH cte AS (
SELECT *, MAX(capture_ts) OVER (PARTITION BY vendor_item_id) max_capture_ts
FROM price_data
)
SELECT DISTINCT vendor_item_id,
FIRST_VALUE(price) OVER (PARTITION BY vendor_item_id ORDER BY capture_ts DESC) curr_price,
MIN(CASE WHEN capture_ts < max_capture_ts THEN price END) OVER (PARTITION BY vendor_item_id) low_price,
MAX(CASE WHEN capture_ts < max_capture_ts THEN price END) OVER (PARTITION BY vendor_item_id) high_price
FROM cte;
参见demo。
我最终使用 CTE 和常规聚合函数来解决问题:
with v_last_capture as (
select vendor_item_id
,max(capture_ts) last_capture_ts
from price_data pd
group by vendor_item_id
)
, v_curr_price as (
select pd.*
from price_data pd
inner join v_last_capture vc
on (pd.vendor_item_id = vc.vendor_item_id and
pd.capture_ts = vc.last_capture_ts)
)
, v_other_prices as (
select vendor_item_id
,min(pd.price) as min_price
,max(pd.price) as max_price
from price_data pd
where id not in (select id from v_curr_price)
group by vendor_item_id
)
select vc.id
,vc.vendor_item_id
,vc.price as curr_price
,vc.stock
,vo.min_price
,vo.max_price
from v_curr_price vc
left join v_other_prices vo on (vc.vendor_item_id = vo.vendor_item_id)
解释计划:
QUERY PLAN
|--MATERIALIZE 4
| |--SCAN TABLE price_data AS pd
| `--USE TEMP B-TREE FOR GROUP BY
|--MATERIALIZE 5
| |--SCAN TABLE price_data AS pd
| |--LIST SUBQUERY 6
| | |--MATERIALIZE 8
| | | |--SCAN TABLE price_data AS pd
| | | `--USE TEMP B-TREE FOR GROUP BY
| | |--SCAN SUBQUERY 8 AS vc
| | `--SEARCH TABLE price_data AS pd USING AUTOMATIC COVERING INDEX (vendor_item_id=? AND capture_ts=?)
| `--USE TEMP B-TREE FOR GROUP BY
|--SCAN TABLE price_data AS pd
|--SEARCH SUBQUERY 4 AS vc USING AUTOMATIC COVERING INDEX (vendor_item_id=?)
`--SEARCH SUBQUERY 5 AS vo USING AUTOMATIC COVERING INDEX (vendor_item_id=?)
QUERY PLAN
|--CO-ROUTINE 3
| |--CO-ROUTINE 4
| | |--CO-ROUTINE 1
| | | |--CO-ROUTINE 5
| | | | |--SCAN TABLE price_data
| | | | `--USE TEMP B-TREE FOR ORDER BY
| | | `--SCAN SUBQUERY 5
| | |--SCAN SUBQUERY 1
| | `--USE TEMP B-TREE FOR ORDER BY
| |--SCAN SUBQUERY 4
| `--USE TEMP B-TREE FOR ORDER BY
|--SCAN SUBQUERY 3
`--USE TEMP B-TREE FOR DISTINCT
您可以使用 window filters 删除满足“最新捕获除外”要求的最后一行
select distinct
p.vendor_item_id
,last_value(p.price) over vendor_item as curr_price
,min(price) filter (where p.capture_ts < latest.capture_ts) over vendor_item as low_price
,max(price) filter (where p.capture_ts < latest.capture_ts) over vendor_item as high_price
from
price_data p
inner join (
select vendor_item_id, max(capture_ts) capture_ts from price_data group by vendor_item_id
) latest on latest.vendor_item_id = p.vendor_item_id
window
vendor_item as (
partition by p.vendor_item_id
order by p.capture_ts
rows between unbounded preceding and unbounded following
);
结果
124 42.99 42.99 46.78 129 30.83 21.52 25.72 132 40.99 41.99 42.99
我想 capture_ts
对于 vendor_item_id
是唯一的,否则你必须创建一个更智能的过滤器。
裸查询计划 price_data
table 未定义索引:
QUERY PLAN |--CO-ROUTINE 3 | |--MATERIALIZE 1 | | |--SCAN TABLE price_data | | `--USE TEMP B-TREE FOR GROUP BY | |--SCAN TABLE price_data AS p | |--SEARCH SUBQUERY 1 AS latest USING AUTOMATIC COVERING INDEX (vendor_item_id=?) | `--USE TEMP B-TREE FOR ORDER BY |--SCAN SUBQUERY 3 `--USE TEMP B-TREE FOR DISTINCT
定义覆盖索引 (create index ix_price_data on price_data (vendor_item_id, capture_ts, price)
) 后,事情会变得简单一点:
QUERY PLAN |--CO-ROUTINE 3 | |--MATERIALIZE 1 | | `--SCAN TABLE price_data USING COVERING INDEX ix_price_data | |--SCAN SUBQUERY 1 AS latest | |--SEARCH TABLE price_data AS p USING COVERING INDEX ix_price_data (vendor_item_id=?) | `--USE TEMP B-TREE FOR ORDER BY |--SCAN SUBQUERY 3 `--USE TEMP B-TREE FOR DISTINCT
由于覆盖索引会增加数据库大小(毕竟所有数据都作为索引中的副本存在),您可以决定要 re-create price_data
作为集群索引,即创建 table WITHOUT ROWID
并将 vendor_item_id, capture_ts
标记为主键。您也可以删除 then-useless id
列。
这样您将获得与显式索引相同的性能,但不会增加数据库的大小(实际上 table 应该明显变小,因为 row_id 消失了).查询计划保持不变。