如何在 sql 中找到 运行 序列中的峰值和谷值
How to find peak and valley in running sequence in sql
所以我在 athena 中有一个数据集,因此出于此目的,您可以将其视为 postgres 数据库。可以在 sql fiddle.
中看到数据示例
这是一个示例:
create table vals (
timestamp int,
type varchar(25),
val int
);
insert into vals(timestamp,type, val)
values (10, null, 1),
(20, null, 2),
(39, null, 1),
(40,'p',1),
(50,'p',2),
(60,'p',1),
(70,'v',5),
(80,'v',6),
(90,'v',6),
(100,'v',3),
(110,null,3),
(120,'v',6),
(130,null,3),
(140,'p',10),
(150,'p',8),
(160,null,3),
(170,'p',1),
(180,'p',2),
(190,'p',2),
(200,'p',1),
(210,null,3),
(220,'v',1),
(230,'v',1),
(240,'v',3),
(250,'v',41)
我想要得到的是一个包含所有值但突出显示 'p' 的最高值和连续 'v' 的最低值的数据集。
所以最终我会得到:
timestamp, type, value, is_peak
(10, null, 1, null),
(20, null, 2, null),
(39, null, 1, null),
(40,'p',1, null),
(50,'p',2, 1),
(60,'p',1, null),
(70,'v',5, null),
(80,'v',6, null),
(90,'v',6, null),
(100,'v',3, 1),
(110,null,3, null),
(120,'v',6, 1),
(130,null,3, null),
(140,'p',10, 1),
(150,'p',8, null),
(160,null,3, null),
(170,'p',1, null),
(180,'p',2, 1),
(190,'p',2, null), -- either this record or 180 would be fine
(200,'p',1, null),
(210,null,3, null),
(220,'v',1, 1), -- again either this or 230
(230,'v',1, null),
(240,'v',3, null),
(250,'v',41, null)
is peak 的类型有很多选择,如果它是某种 denserank 或递增的数字就好了。这样我就可以确信在连续的集合中 'marked' 是最高值或最低值。
祝你好运感谢协助
注意:峰值的最大值或谷值的最小值可以在连续集中的任意位置,但一旦类型发生变化,我们就会重新开始。
您可以在 case
语句中使用子查询来实现此目的:
create table #vals
(
[timestamp] int,
[type] varchar(25),
val int
);
insert into #vals ([timestamp], [type], val)
values (10, null, 1),
(20, null, 2),
(30, null, 1),
(40,'p',1),
(50,'p',2),
(60,'p',1),
(70,'v',5),
(80,'v',6),
(90,'v',6),
(100,'v',3),
(110,null,3)
select
r.*,
case
when r.[type] = 'p' and not exists (select * from #vals c where c.[type] = r.[type] and c.val > r.val) then 1
when r.[type] = 'v' and not exists (select * from #vals c where c.[type] = r.[type] and c.val < r.val) then 1
else null
end as is_peak
from #vals r
drop table #vals
结果:
/----------------------------------\
| timestamp | type | val | is_peak |
|-----------|------|-----|---------|
| 10 | NULL | 1 | NULL |
| 20 | NULL | 2 | NULL |
| 30 | NULL | 1 | NULL |
| 40 | p | 1 | NULL |
| 50 | p | 2 | 1 |
| 60 | p | 1 | NULL |
| 70 | v | 5 | NULL |
| 80 | v | 6 | NULL |
| 90 | v | 6 | NULL |
| 100 | v | 3 | 1 |
| 110 | NULL | 3 | NULL |
\----------------------------------/
注意:如果有多个记录具有相同的(峰值)val
,它们将在 is_peak
列中分别用 1
标记。
你可以使用 LEAD/LAG window functions:
SELECT *,
CASE WHEN type = 'p' AND val>LAG(val) OVER(PARTITION BY type ORDER BY timestamp)
AND val > LEAD(val) OVER(PARTITION BY type ORDER BY timestamp) THEN 1
WHEN type = 'v' AND val<LAG(val) OVER(PARTITION BY type ORDER BY timestamp)
AND val < LEAD(val) OVER(PARTITION BY type ORDER BY timestamp) THEN 1
END AS is_peak
FROM vals
ORDER BY timestamp;
输出:
┌───────────┬───────┬──────┬─────────┐
│ timestamp │ type │ val │ is_peak │
├───────────┼───────┼──────┼─────────┤
│ 10 │ │ 1 │ │
│ 20 │ │ 2 │ │
│ 39 │ │ 1 │ │
│ 40 │ p │ 1 │ │
│ 50 │ p │ 2 │ 1 │
│ 60 │ p │ 1 │ │
│ 70 │ v │ 5 │ │
│ 80 │ v │ 6 │ │
│ 90 │ v │ 6 │ │
│ 100 │ v │ 3 │ 1 │
│ 110 │ │ 3 │ │
│ 120 │ v │ 6 │ │
│ 130 │ │ 3 │ │
│ 140 │ p │ 10 │ 1 │
│ 150 │ p │ 8 │ │
└───────────┴───────┴──────┴─────────┘
带有 window 子句的版本:
SELECT *, CASE WHEN type = 'p' AND val > LAG(val) OVER s
AND val > LEAD(val) OVER s THEN 1
WHEN type = 'v' AND val < LAG(val) OVER s
AND val < LEAD(val) OVER s THEN 1
END AS is_peak
FROM vals
WINDOW s AS (PARTITION BY type ORDER BY timestamp)
ORDER BY timestamp;
编辑
I think with a hopefully small change we can get timestamp 120 also, then that'll be it
SELECT *,CASE
WHEN type IN ('p','v') AND val > LAG(val,1,0) OVER(PARTITION BY type ORDER BY timestamp)
AND val > LEAD(val,1,0) OVER(PARTITION BY type ORDER BY timestamp) THEN 1
WHEN type IN ('v') AND val < LAG(val,1,0) OVER(PARTITION BY type ORDER BY timestamp)
AND val < LEAD(val,1,0) OVER(PARTITION BY type ORDER BY timestamp) THEN 1
END AS is_peak
FROM vals
ORDER BY timestamp;
编辑 2:
带有gaps-and-islands
检测的最终解决方案(处理高原):
WITH cte AS (
SELECT *, LEAD(val,1,0) OVER(PARTITION BY type ORDER BY timestamp) AS l
FROM vals
), cte2 AS (
SELECT *, SUM(CASE WHEN val = l THEN 1 ELSE 0 END) OVER(PARTITION BY type ORDER BY timestamp) AS dr
FROM cte
), cte3 AS (
SELECT *, CASE WHEN type IN ('p') AND val > LAG(val,1) OVER(PARTITION BY type ORDER BY timestamp)
AND val >= LEAD(val,1) OVER(PARTITION BY type ORDER BY timestamp) THEN 1
WHEN type IN ('v') AND val < LAG(val,1) OVER(PARTITION BY type ORDER BY timestamp)
AND val <= LEAD(val,1) OVER(PARTITION BY type ORDER BY timestamp) THEN 1
END AS is_peak
FROM cte2
)
SELECT timestamp, type, val,
CASE WHEN is_peak = 1 THEN 1
WHEN EXISTS (SELECT 1 FROM cte3 cx
WHERE cx.is_peak = 1
AND cx.val = cte3.val
AND cx.type = cte3.type
AND cx.dr = cte3.dr)
THEN 1
END is_peak
FROM cte3
ORDER BY timestamp;
输出:
┌────────────┬───────┬──────┬─────────┐
│ timestamp │ type │ val │ is_peak │
├────────────┼───────┼──────┼─────────┤
│ 10 │ │ 1 │ │
│ 20 │ │ 2 │ │
│ 39 │ │ 1 │ │
│ 40 │ p │ 1 │ │
│ 50 │ p │ 2 │ 1 │
│ 60 │ p │ 1 │ │
│ 70 │ v │ 5 │ │
│ 80 │ v │ 6 │ │
│ 90 │ v │ 6 │ │
│ 100 │ v │ 3 │ 1 │
│ 110 │ │ 3 │ │
│ 120 │ v │ 6 │ │
│ 130 │ │ 3 │ │
│ 140 │ p │ 10 │ 1 │
│ 150 │ p │ 8 │ │
│ 160 │ │ 3 │ │
│ 170 │ p │ 1 │ │
│ 180 │ p │ 2 │ 1 │
│ 190 │ p │ 2 │ 1 │
│ 200 │ p │ 1 │ │
│ 210 │ │ 3 │ │
│ 220 │ v │ 1 │ 1 │
│ 230 │ v │ 1 │ 1 │
│ 240 │ v │ 3 │ │
│ 250 │ v │ 41 │ │
└────────────┴───────┴──────┴─────────┘
补充说明:
ISO SQL:2016 为这种情况添加了模式匹配 MATCH_RECOGNIZE,您可以像 PATTERN (STRT UP+ FLAT* DOWN+)
那样为峰值定义正则表达式,但目前只有 Oracle 支持。
相关文章:Modern SQL - match_recognize Regular Expressions Over Rows
有一个小技巧可用于像这样的间隙和孤岛问题。
通过从一个值的 row_number 中减去 row_number,您可以获得一些排名。
出于某些目的,此方法存在一些缺点。
但它适用于这种情况。
一旦计算出排名,它就可以被外部查询中的其他 window 函数使用。
为此,我们可以再次使用 row_number。
但根据需要,您可以使用 DENSE_RANK 或 MIN & MAX 的 window 函数代替。
然后我们根据类型将它们包装在 CASE
中以实现不同的逻辑。
select timestamp, type, val,
(case
when type = 'v' and row_number() over (partition by (rn1-rn2), type order by val, rn1) = 1 then 1
when type = 'p' and row_number() over (partition by (rn1-rn2), type order by val desc, rn1) = 1 then 1
end) is_peak
-- , rn1, rn2, (rn1-rn2) as rnk
from
(
select timestamp, type, val,
row_number() over (order by timestamp) as rn1,
row_number() over (partition by type order by timestamp) as rn2
from vals
) q
order by timestamp;
你可以测试一个SQLFiddlehere
Returns:
timestamp type val is_peak
--------- ---- ---- -------
10 null 1 null
20 null 2 null
39 null 1 null
40 p 1 null
50 p 2 1
60 p 1 null
70 v 5 null
80 v 6 null
90 v 6 null
100 v 3 1
110 null 3 null
120 v 6 1
130 null 3 null
140 p 10 1
150 p 8 null
160 null 3 null
170 p 1 null
180 p 2 1
190 p 2 null
200 p 1 null
210 null 3 null
220 v 1 1
230 v 1 null
240 v 3 null
250 v 41 null
所以我在 athena 中有一个数据集,因此出于此目的,您可以将其视为 postgres 数据库。可以在 sql fiddle.
中看到数据示例这是一个示例:
create table vals (
timestamp int,
type varchar(25),
val int
);
insert into vals(timestamp,type, val)
values (10, null, 1),
(20, null, 2),
(39, null, 1),
(40,'p',1),
(50,'p',2),
(60,'p',1),
(70,'v',5),
(80,'v',6),
(90,'v',6),
(100,'v',3),
(110,null,3),
(120,'v',6),
(130,null,3),
(140,'p',10),
(150,'p',8),
(160,null,3),
(170,'p',1),
(180,'p',2),
(190,'p',2),
(200,'p',1),
(210,null,3),
(220,'v',1),
(230,'v',1),
(240,'v',3),
(250,'v',41)
我想要得到的是一个包含所有值但突出显示 'p' 的最高值和连续 'v' 的最低值的数据集。
所以最终我会得到:
timestamp, type, value, is_peak
(10, null, 1, null),
(20, null, 2, null),
(39, null, 1, null),
(40,'p',1, null),
(50,'p',2, 1),
(60,'p',1, null),
(70,'v',5, null),
(80,'v',6, null),
(90,'v',6, null),
(100,'v',3, 1),
(110,null,3, null),
(120,'v',6, 1),
(130,null,3, null),
(140,'p',10, 1),
(150,'p',8, null),
(160,null,3, null),
(170,'p',1, null),
(180,'p',2, 1),
(190,'p',2, null), -- either this record or 180 would be fine
(200,'p',1, null),
(210,null,3, null),
(220,'v',1, 1), -- again either this or 230
(230,'v',1, null),
(240,'v',3, null),
(250,'v',41, null)
is peak 的类型有很多选择,如果它是某种 denserank 或递增的数字就好了。这样我就可以确信在连续的集合中 'marked' 是最高值或最低值。
祝你好运感谢协助
注意:峰值的最大值或谷值的最小值可以在连续集中的任意位置,但一旦类型发生变化,我们就会重新开始。
您可以在 case
语句中使用子查询来实现此目的:
create table #vals
(
[timestamp] int,
[type] varchar(25),
val int
);
insert into #vals ([timestamp], [type], val)
values (10, null, 1),
(20, null, 2),
(30, null, 1),
(40,'p',1),
(50,'p',2),
(60,'p',1),
(70,'v',5),
(80,'v',6),
(90,'v',6),
(100,'v',3),
(110,null,3)
select
r.*,
case
when r.[type] = 'p' and not exists (select * from #vals c where c.[type] = r.[type] and c.val > r.val) then 1
when r.[type] = 'v' and not exists (select * from #vals c where c.[type] = r.[type] and c.val < r.val) then 1
else null
end as is_peak
from #vals r
drop table #vals
结果:
/----------------------------------\
| timestamp | type | val | is_peak |
|-----------|------|-----|---------|
| 10 | NULL | 1 | NULL |
| 20 | NULL | 2 | NULL |
| 30 | NULL | 1 | NULL |
| 40 | p | 1 | NULL |
| 50 | p | 2 | 1 |
| 60 | p | 1 | NULL |
| 70 | v | 5 | NULL |
| 80 | v | 6 | NULL |
| 90 | v | 6 | NULL |
| 100 | v | 3 | 1 |
| 110 | NULL | 3 | NULL |
\----------------------------------/
注意:如果有多个记录具有相同的(峰值)val
,它们将在 is_peak
列中分别用 1
标记。
你可以使用 LEAD/LAG window functions:
SELECT *,
CASE WHEN type = 'p' AND val>LAG(val) OVER(PARTITION BY type ORDER BY timestamp)
AND val > LEAD(val) OVER(PARTITION BY type ORDER BY timestamp) THEN 1
WHEN type = 'v' AND val<LAG(val) OVER(PARTITION BY type ORDER BY timestamp)
AND val < LEAD(val) OVER(PARTITION BY type ORDER BY timestamp) THEN 1
END AS is_peak
FROM vals
ORDER BY timestamp;
输出:
┌───────────┬───────┬──────┬─────────┐
│ timestamp │ type │ val │ is_peak │
├───────────┼───────┼──────┼─────────┤
│ 10 │ │ 1 │ │
│ 20 │ │ 2 │ │
│ 39 │ │ 1 │ │
│ 40 │ p │ 1 │ │
│ 50 │ p │ 2 │ 1 │
│ 60 │ p │ 1 │ │
│ 70 │ v │ 5 │ │
│ 80 │ v │ 6 │ │
│ 90 │ v │ 6 │ │
│ 100 │ v │ 3 │ 1 │
│ 110 │ │ 3 │ │
│ 120 │ v │ 6 │ │
│ 130 │ │ 3 │ │
│ 140 │ p │ 10 │ 1 │
│ 150 │ p │ 8 │ │
└───────────┴───────┴──────┴─────────┘
带有 window 子句的版本:
SELECT *, CASE WHEN type = 'p' AND val > LAG(val) OVER s
AND val > LEAD(val) OVER s THEN 1
WHEN type = 'v' AND val < LAG(val) OVER s
AND val < LEAD(val) OVER s THEN 1
END AS is_peak
FROM vals
WINDOW s AS (PARTITION BY type ORDER BY timestamp)
ORDER BY timestamp;
编辑
I think with a hopefully small change we can get timestamp 120 also, then that'll be it
SELECT *,CASE
WHEN type IN ('p','v') AND val > LAG(val,1,0) OVER(PARTITION BY type ORDER BY timestamp)
AND val > LEAD(val,1,0) OVER(PARTITION BY type ORDER BY timestamp) THEN 1
WHEN type IN ('v') AND val < LAG(val,1,0) OVER(PARTITION BY type ORDER BY timestamp)
AND val < LEAD(val,1,0) OVER(PARTITION BY type ORDER BY timestamp) THEN 1
END AS is_peak
FROM vals
ORDER BY timestamp;
编辑 2:
带有gaps-and-islands
检测的最终解决方案(处理高原):
WITH cte AS (
SELECT *, LEAD(val,1,0) OVER(PARTITION BY type ORDER BY timestamp) AS l
FROM vals
), cte2 AS (
SELECT *, SUM(CASE WHEN val = l THEN 1 ELSE 0 END) OVER(PARTITION BY type ORDER BY timestamp) AS dr
FROM cte
), cte3 AS (
SELECT *, CASE WHEN type IN ('p') AND val > LAG(val,1) OVER(PARTITION BY type ORDER BY timestamp)
AND val >= LEAD(val,1) OVER(PARTITION BY type ORDER BY timestamp) THEN 1
WHEN type IN ('v') AND val < LAG(val,1) OVER(PARTITION BY type ORDER BY timestamp)
AND val <= LEAD(val,1) OVER(PARTITION BY type ORDER BY timestamp) THEN 1
END AS is_peak
FROM cte2
)
SELECT timestamp, type, val,
CASE WHEN is_peak = 1 THEN 1
WHEN EXISTS (SELECT 1 FROM cte3 cx
WHERE cx.is_peak = 1
AND cx.val = cte3.val
AND cx.type = cte3.type
AND cx.dr = cte3.dr)
THEN 1
END is_peak
FROM cte3
ORDER BY timestamp;
输出:
┌────────────┬───────┬──────┬─────────┐
│ timestamp │ type │ val │ is_peak │
├────────────┼───────┼──────┼─────────┤
│ 10 │ │ 1 │ │
│ 20 │ │ 2 │ │
│ 39 │ │ 1 │ │
│ 40 │ p │ 1 │ │
│ 50 │ p │ 2 │ 1 │
│ 60 │ p │ 1 │ │
│ 70 │ v │ 5 │ │
│ 80 │ v │ 6 │ │
│ 90 │ v │ 6 │ │
│ 100 │ v │ 3 │ 1 │
│ 110 │ │ 3 │ │
│ 120 │ v │ 6 │ │
│ 130 │ │ 3 │ │
│ 140 │ p │ 10 │ 1 │
│ 150 │ p │ 8 │ │
│ 160 │ │ 3 │ │
│ 170 │ p │ 1 │ │
│ 180 │ p │ 2 │ 1 │
│ 190 │ p │ 2 │ 1 │
│ 200 │ p │ 1 │ │
│ 210 │ │ 3 │ │
│ 220 │ v │ 1 │ 1 │
│ 230 │ v │ 1 │ 1 │
│ 240 │ v │ 3 │ │
│ 250 │ v │ 41 │ │
└────────────┴───────┴──────┴─────────┘
补充说明:
ISO SQL:2016 为这种情况添加了模式匹配 MATCH_RECOGNIZE,您可以像 PATTERN (STRT UP+ FLAT* DOWN+)
那样为峰值定义正则表达式,但目前只有 Oracle 支持。
相关文章:Modern SQL - match_recognize Regular Expressions Over Rows
有一个小技巧可用于像这样的间隙和孤岛问题。
通过从一个值的 row_number 中减去 row_number,您可以获得一些排名。
出于某些目的,此方法存在一些缺点。
但它适用于这种情况。
一旦计算出排名,它就可以被外部查询中的其他 window 函数使用。
为此,我们可以再次使用 row_number。
但根据需要,您可以使用 DENSE_RANK 或 MIN & MAX 的 window 函数代替。
然后我们根据类型将它们包装在 CASE
中以实现不同的逻辑。
select timestamp, type, val,
(case
when type = 'v' and row_number() over (partition by (rn1-rn2), type order by val, rn1) = 1 then 1
when type = 'p' and row_number() over (partition by (rn1-rn2), type order by val desc, rn1) = 1 then 1
end) is_peak
-- , rn1, rn2, (rn1-rn2) as rnk
from
(
select timestamp, type, val,
row_number() over (order by timestamp) as rn1,
row_number() over (partition by type order by timestamp) as rn2
from vals
) q
order by timestamp;
你可以测试一个SQLFiddlehere
Returns:
timestamp type val is_peak
--------- ---- ---- -------
10 null 1 null
20 null 2 null
39 null 1 null
40 p 1 null
50 p 2 1
60 p 1 null
70 v 5 null
80 v 6 null
90 v 6 null
100 v 3 1
110 null 3 null
120 v 6 1
130 null 3 null
140 p 10 1
150 p 8 null
160 null 3 null
170 p 1 null
180 p 2 1
190 p 2 null
200 p 1 null
210 null 3 null
220 v 1 1
230 v 1 null
240 v 3 null
250 v 41 null