如何在某个值差距之前识别每组行?

How to identify rows per group before a certain value gap?

我想根据 PostgreSQL 中相邻行之间另一列值的差异更新 table 中的特定列。

这是一个测试设置:

CREATE TABLE test(
   main INTEGER,
   sub_id INTEGER,
   value_t INTEGER);

INSERT INTO test (main, sub_id, value_t)
VALUES
    (1,1,8),
    (1,2,7),
    (1,3,3),
    (1,4,85),
    (1,5,40),
    (2,1,3),
    (2,2,1),
    (2,3,1),
    (2,4,8),
    (2,5,41);

我的目标是通过检查升序来确定从 sub_id 1 开始的每个组 maindiff 中的哪个值超过特定阈值(例如 <10 或 >-10)按 sub_id 排序。在达到阈值之前,我想标记每个通过的行 AND 条件为 FALSE 的一行,方法是在 newval 列中填充一个值,例如1.

我应该使用循环还是有更智能的解决方案?

伪代码中的任务描述:

FOR i in GROUP [PARTITION BY main ORDER BY sub_id]:
    DO until diff > 10 OR diff <-10
        SET newval = 1 AND LEAD(newval) = 1

聚合子查询上的 EXISTS:


UPDATE test u
SET value_t = NULL
WHERE EXISTS (
        SELECT * FROM (
                SELECT main,sub_id
                , value_t , ABS(value_t - lag(value_t)
                       OVER (PARTITION BY main ORDER BY sub_id) ) AS absdiff
                FROM test
                ) x
        WHERE x.main = u.main
        AND x.sub_id <= u.sub_id
        AND x.absdiff >= 10
        )
        ;

SELECT * FROM test
ORDER BY main, sub_id;

结果:


UPDATE 3
 main | sub_id | value_t 
------+--------+---------
    1 |      1 |       8
    1 |      2 |       7
    1 |      3 |       3
    1 |      4 |        
    1 |      5 |        
    2 |      1 |       3
    2 |      2 |       1
    2 |      3 |       1
    2 |      4 |       8
    2 |      5 |        
(10 rows)

您的问题很难理解,“value_t”栏与问题无关,您忘记在 SQL 中定义“diff”栏。

无论如何,这是您的解决方案:

WITH data AS (
  SELECT main, sub_id, value_t
       , abs(value_t
             - lead(value_t) OVER (PARTITION BY main ORDER BY sub_id)) > 10 is_evil
  FROM test
)
SELECT main, sub_id, value_t
     , CASE max(is_evil::int)
            OVER (PARTITION BY main ORDER BY sub_id
                  ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
       WHEN 1 THEN NULL ELSE 1 END newval
FROM data;

我正在使用 CTE 准备数据(计算一行是否为“邪恶”),然后使用“max”window 函数检查是否有任何“邪恶”行在当前分区之前,每个分区。

基本SELECT

尽快:

SELECT *, bool_and(diff BETWEEN -10 AND 10) OVER (PARTITION BY main ORDER BY sub_id) AS flag
FROM (
   SELECT *, value_t - lag(value_t, 1, value_t) OVER (PARTITION BY main ORDER BY sub_id) AS diff
   FROM   test
   ) sub;
细点
  • 您的思维模型围绕 window 函数发展 lead(). But its counterpart lag() 就此目的而言效率更高一些,因为包含该行时不会出现差一错误在巨大的差距之前。 或者,使用 lead() 和反向排序 (ORDER BY sub_id DESC)。

  • 为了避免分区中的第一行出现 NULL,提供 value_t 作为默认的第三个参数,这使得 diff 0 而不是 NULL。 lead()lag() 都具有这种能力。

  • diff BETWEEN -10 AND 10@diff < 11 稍快(也更清晰、更灵活)。 (@ being the "absolute value" operator, equivalent to the abs() function.)

  • 外部 window 函数中的
  • bool_or() or bool_and() 可能最便宜地标记所有行直至大间隙。

你的UPDATE

Until the threshold is reached I would like to flag every passed row AND the one row where the condition is FALSE by filling column newval with a value e.g. 1.

再次,尽快。

UPDATE test AS t
SET    newval = 1
FROM  (
   SELECT main, sub_id
        , bool_and(diff BETWEEN -10 AND 10) OVER (PARTITION BY main ORDER BY sub_id) AS flag
   FROM (
      SELECT main, sub_id
           , value_t - lag(value_t, 1, value_t) OVER (PARTITION BY main ORDER BY sub_id) AS diff
      FROM   test
      ) sub
   ) u
WHERE (t.main, t.sub_id) = (u.main, u.sub_id)
AND    u.flag;
细点
  • 在单个查询中计算所有值通常比相关子查询快得多。

  • 添加的 WHERE 条件 AND u.flag 确保我们只更新实际需要更新的行。
    如果某些行可能已经在 newval 中具有正确的值,请添加另一个子句以避免这些空更新:AND t.newval IS DISTINCT FROM 1 参见:

    • How do I (or can I) SELECT DISTINCT on multiple columns?
  • SET newval = 1 分配一个常量(尽管在这种情况下我们可以使用实际计算的值),这会更便宜一些。

db<>fiddle here