如何在 sql 中找到 运行 序列中的峰值和谷值

How to find peak and valley in running sequence in sql

所以我在 athena 中有一个数据集,因此出于此目的,您可以将其视为 postgres 数据库。可以在 sql fiddle.

中看到数据示例

这是一个示例:

  create table vals (
  timestamp int,
  type varchar(25),
  val int
  );

  insert into vals(timestamp,type, val) 
  values      (10, null, 1),
              (20, null, 2),
              (39, null, 1),
              (40,'p',1),
              (50,'p',2),
              (60,'p',1),
              (70,'v',5),
              (80,'v',6),
              (90,'v',6),
              (100,'v',3),
              (110,null,3),
              (120,'v',6),
              (130,null,3),
              (140,'p',10),
              (150,'p',8),
              (160,null,3),
              (170,'p',1),
              (180,'p',2),
              (190,'p',2),
              (200,'p',1),
              (210,null,3),
              (220,'v',1),
              (230,'v',1),
              (240,'v',3),
              (250,'v',41)               

我想要得到的是一个包含所有值但突出显示 'p' 的最高值和连续 'v' 的最低值的数据集。

所以最终我会得到:

   timestamp, type, value, is_peak
    (10, null, 1, null),
    (20, null, 2, null),
    (39, null, 1, null),
    (40,'p',1, null),
    (50,'p',2, 1),
    (60,'p',1, null),
    (70,'v',5, null),
    (80,'v',6, null),
    (90,'v',6, null),
    (100,'v',3, 1),
    (110,null,3, null),
    (120,'v',6, 1),
    (130,null,3, null),
    (140,'p',10, 1),
    (150,'p',8, null),
    (160,null,3, null),
    (170,'p',1, null),
    (180,'p',2, 1),
    (190,'p',2, null), -- either this record or 180 would be fine
    (200,'p',1, null),
    (210,null,3, null),
    (220,'v',1, 1), -- again either this or 230
    (230,'v',1, null),
    (240,'v',3, null),
    (250,'v',41, null) 

is peak 的类型有很多选择,如果它是某种 denserank 或递增的数字就好了。这样我就可以确信在连续的集合中 'marked' 是最高值或最低值。

祝你好运感谢协助

注意:峰值的最大值或谷值的最小值可以在连续集中的任意位置,但一旦类型发生变化,我们就会重新开始。

您可以在 case 语句中使用子查询来实现此目的:

create table #vals 
(
    [timestamp] int,
    [type] varchar(25),
    val int
);

insert into #vals ([timestamp], [type], val) 
values  (10, null, 1),
        (20, null, 2),
        (30, null, 1),
        (40,'p',1),
        (50,'p',2),
        (60,'p',1),
        (70,'v',5),
        (80,'v',6),
        (90,'v',6),
        (100,'v',3),
        (110,null,3)

select 
    r.*,
    case 
        when r.[type] = 'p' and not exists (select * from #vals c where c.[type] = r.[type] and c.val > r.val) then 1
        when r.[type] = 'v' and not exists (select * from #vals c where c.[type] = r.[type] and c.val < r.val) then 1
        else null
    end as is_peak
from #vals r

drop table #vals

结果:

/----------------------------------\
| timestamp | type | val | is_peak |
|-----------|------|-----|---------|
| 10        | NULL | 1   | NULL    |
| 20        | NULL | 2   | NULL    |
| 30        | NULL | 1   | NULL    |
| 40        | p    | 1   | NULL    |
| 50        | p    | 2   | 1       |
| 60        | p    | 1   | NULL    |
| 70        | v    | 5   | NULL    |
| 80        | v    | 6   | NULL    |
| 90        | v    | 6   | NULL    |
| 100       | v    | 3   | 1       |
| 110       | NULL | 3   | NULL    |
\----------------------------------/

注意:如果有多个记录具有相同的(峰值)val,它们将在 is_peak 列中分别用 1 标记。

你可以使用 LEAD/LAG window functions:

SELECT *,
  CASE WHEN type = 'p' AND val>LAG(val) OVER(PARTITION BY type ORDER BY timestamp)
        AND val > LEAD(val) OVER(PARTITION BY type ORDER BY timestamp) THEN 1 
       WHEN type = 'v' AND val<LAG(val) OVER(PARTITION BY type ORDER BY timestamp)
       AND val < LEAD(val) OVER(PARTITION BY type ORDER BY timestamp) THEN 1 
  END AS is_peak
FROM vals
ORDER BY timestamp;

db<>fiddle demo

输出:

┌───────────┬───────┬──────┬─────────┐
│ timestamp │ type  │ val  │ is_peak │
├───────────┼───────┼──────┼─────────┤
│       10  │       │   1  │         │
│       20  │       │   2  │         │
│       39  │       │   1  │         │
│       40  │ p     │   1  │         │
│       50  │ p     │   2  │       1 │
│       60  │ p     │   1  │         │
│       70  │ v     │   5  │         │
│       80  │ v     │   6  │         │
│       90  │ v     │   6  │         │
│      100  │ v     │   3  │       1 │
│      110  │       │   3  │         │
│      120  │ v     │   6  │         │
│      130  │       │   3  │         │
│      140  │ p     │  10  │       1 │
│      150  │ p     │   8  │         │
└───────────┴───────┴──────┴─────────┘

带有 window 子句的版本:

SELECT *, CASE WHEN type = 'p' AND val > LAG(val) OVER s
                AND val > LEAD(val) OVER s THEN 1 
               WHEN type = 'v' AND val < LAG(val) OVER s
                AND val < LEAD(val) OVER s THEN 1 
          END AS is_peak
FROM vals
WINDOW s AS (PARTITION BY type ORDER BY timestamp)
ORDER BY timestamp;

db<>fiddle demo2

编辑

I think with a hopefully small change we can get timestamp 120 also, then that'll be it

SELECT *,CASE
  WHEN type IN ('p','v') AND val > LAG(val,1,0) OVER(PARTITION BY type ORDER BY timestamp)
  AND val > LEAD(val,1,0) OVER(PARTITION BY type ORDER BY timestamp) THEN 1 
  WHEN type IN ('v') AND val < LAG(val,1,0) OVER(PARTITION BY type ORDER BY timestamp)
  AND val < LEAD(val,1,0) OVER(PARTITION BY type ORDER BY timestamp) THEN 1 
 END AS is_peak
FROM vals
ORDER BY timestamp;

db<>fiddle demo3


编辑 2:

带有gaps-and-islands检测的最终解决方案(处理高原):

WITH cte AS (
  SELECT *, LEAD(val,1,0) OVER(PARTITION BY type ORDER BY timestamp) AS l
  FROM vals
), cte2 AS (
  SELECT *, SUM(CASE WHEN val = l THEN 1 ELSE 0 END) OVER(PARTITION BY type ORDER BY timestamp) AS dr
  FROM cte
), cte3 AS (
  SELECT *, CASE WHEN type IN ('p') AND val > LAG(val,1) OVER(PARTITION BY type ORDER BY timestamp)
                AND val >= LEAD(val,1) OVER(PARTITION BY type ORDER BY timestamp) THEN 1 
               WHEN type IN ('v') AND val < LAG(val,1) OVER(PARTITION BY type ORDER BY timestamp)
                AND val <= LEAD(val,1) OVER(PARTITION BY type ORDER BY timestamp) THEN 1 
          END AS is_peak
  FROM cte2
)
SELECT timestamp, type, val,
     CASE WHEN is_peak = 1 THEN 1 
          WHEN EXISTS (SELECT 1 FROM cte3 cx
                       WHERE cx.is_peak = 1
                         AND cx.val = cte3.val
                         AND cx.type = cte3.type
                         AND cx.dr = cte3.dr)
              THEN 1
     END is_peak
FROM cte3
ORDER BY timestamp;

db<>fiddle demo final

输出:

┌────────────┬───────┬──────┬─────────┐
│ timestamp  │ type  │ val  │ is_peak │
├────────────┼───────┼──────┼─────────┤
│        10  │       │   1  │         │
│        20  │       │   2  │         │
│        39  │       │   1  │         │
│        40  │ p     │   1  │         │
│        50  │ p     │   2  │       1 │
│        60  │ p     │   1  │         │
│        70  │ v     │   5  │         │
│        80  │ v     │   6  │         │
│        90  │ v     │   6  │         │
│       100  │ v     │   3  │       1 │
│       110  │       │   3  │         │
│       120  │ v     │   6  │         │
│       130  │       │   3  │         │
│       140  │ p     │  10  │       1 │
│       150  │ p     │   8  │         │
│       160  │       │   3  │         │
│       170  │ p     │   1  │         │
│       180  │ p     │   2  │       1 │
│       190  │ p     │   2  │       1 │
│       200  │ p     │   1  │         │
│       210  │       │   3  │         │
│       220  │ v     │   1  │       1 │
│       230  │ v     │   1  │       1 │
│       240  │ v     │   3  │         │
│       250  │ v     │  41  │         │
└────────────┴───────┴──────┴─────────┘

补充说明:

ISO SQL:2016 为这种情况添加了模式匹配 MATCH_RECOGNIZE,您可以像 PATTERN (STRT UP+ FLAT* DOWN+) 那样为峰值定义正则表达式,但目前只有 Oracle 支持。

相关文章:Modern SQL - match_recognize Regular Expressions Over Rows

有一个小技巧可用于像这样的间隙和孤岛问题。

通过从一个值的 row_number 中减去 row_number,您可以获得一些排名。

出于某些目的,此方法存在一些缺点。
但它适用于这种情况。

一旦计算出排名,它就可以被外部查询中的其他 window 函数使用。
为此,我们可以再次使用 row_number。 但根据需要,您可以使用 DENSE_RANK 或 MIN & MAX 的 window 函数代替。

然后我们根据类型将它们包装在 CASE 中以实现不同的逻辑。

select timestamp, type, val, 
(case 
 when type = 'v' and row_number() over (partition by (rn1-rn2), type order by val, rn1) = 1 then 1
 when type = 'p' and row_number() over (partition by (rn1-rn2), type order by val desc, rn1) = 1 then 1
 end) is_peak
-- , rn1, rn2, (rn1-rn2) as rnk
from
(
  select timestamp, type, val,
   row_number() over (order by timestamp) as rn1,
   row_number() over (partition by type order by timestamp) as rn2
  from vals
) q
order by timestamp;

你可以测试一个SQLFiddlehere

Returns:

timestamp   type    val     is_peak
---------   ----    ----    -------
10          null    1       null
20          null    2       null
39          null    1       null
40          p       1       null
50          p       2       1
60          p       1       null
70          v       5       null
80          v       6       null
90          v       6       null
100         v       3       1
110         null    3       null
120         v       6       1
130         null    3       null
140         p       10      1
150         p       8       null
160         null    3       null
170         p       1       null
180         p       2       1
190         p       2       null
200         p       1       null
210         null    3       null
220         v       1       1
230         v       1       null
240         v       3       null
250         v       41      null