获取第一个、第二个、第三个...最后一个值并选择行(Window 具有过滤器和滞后功能)

Take first, second, third ... last value and selecting rows (Window function with filter and lag)

我想执行带有过滤子句的 window 函数,例如:

LAG("date", 1) FILTER (WHERE type='A') OVER (PARTITION BY id ORDER BY id ASC) AS "A_lag_1"

但是,Postgres 不支持此操作,但我无法确定还可以如何操作。详情如下

挑战

输入tab_A:

+----+------+------+
| id | type | date |
+----+------+------+
|  1 | A    |   30 |
|  1 | A    |   25 |
|  1 | A    |   20 |
|  1 | B    |   29 |
|  1 | B    |   28 |
|  1 | B    |   21 |
|  1 | C    |   24 |
|  1 | C    |   22 |
+----+------+------+

期望的输出:

+----+------+------+---------+---------+---------+---------+---------+---------+
| id | type | date | A_lag_1 | A_lag_2 | B_lag_1 | B_lag_2 | C_lag_1 | C_lag_2 |
+----+------+------+---------+---------+---------+---------+---------+---------+
|  1 | A    |   30 |      25 |      20 |      29 |      28 |      24 |      22 |
|  1 | A    |   25 |      20 |         |         |         |      24 |      22 |
|  1 | A    |   20 |         |         |         |         |         |         |
|  1 | B    |   29 |      25 |      20 |      28 |      21 |      24 |      22 |
|  1 | B    |   28 |      25 |      20 |      21 |         |      24 |      22 |
|  1 | B    |   21 |      20 |         |         |         |      24 |      22 |
|  1 | C    |   24 |      20 |         |      21 |         |      22 |         |
|  1 | C    |   22 |      20 |         |      21 |         |         |         |
+----+------+------+---------+---------+---------+---------+---------+---------+

换言之:

上面的例子非常简单,在我的实际用例中会有更多的 id 值,更多的滞后列迭代 A_lag_X 和类型。

可能的解决方案 这个挑战似乎非常适合 window function,因为我想保留相同数量的行 tab_A 并附加与该行相关但在过去的信息。

所以使用 window 函数构造它 (sqlfiddle):

SELECT
  id, type, "date",
  LAG("date", 1) FILTER (WHERE type='A') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "A_lag_1",
  LAG("date", 2) FILTER (WHERE type='A') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "A_lag_2",
  LAG("date", 1) FILTER (WHERE type='B') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "B_lag_1",
  LAG("date", 2) FILTER (WHERE type='B') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "B_lag_2",
  LAG("date", 1) FILTER (WHERE type='C') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "C_lag_1",
  LAG("date", 2) FILTER (WHERE type='C') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "C_lag_2"
FROM tab_A

但是,我收到以下错误:

ERROR: FILTER is not implemented for non-aggregate window functions Position: 30

尽管 documentation 中引用了此错误,但我无法确定另一种方法。

如有任何帮助,我们将不胜感激。


其他问题:

你可以试试下面的方法。

SELECT
dt.* ,
(SELECT MAX(b.dateVAL)  FROM tab_A  b WHERE b.type = 'A' AND dt.A_lag_1 >  b.dateVAL  ) AS "A_lag_2",
(SELECT MAX(b.dateVAL)  FROM tab_A  b WHERE b.type = 'B' AND dt.B_lag_1 >  b.dateVAL  ) AS "B_lag_2" ,
(SELECT MAX(b.dateVAL)  FROM tab_A  b WHERE b.type = 'C' AND dt.C_lag_1 >  b.dateVAL  ) AS "C_lag_2"
FROM
(
SELECT
  a.id, a.type, a.dateVAL,
 (SELECT MAX(b.dateVAL)  FROM tab_A  b WHERE b.type = 'A' AND a.dateVAL >  b.dateVAL  )  as A_lag_1,
 (SELECT MAX(b.dateVAL)  FROM tab_A  b WHERE b.type = 'B' AND a.dateVAL >  b.dateVAL  )  as B_lag_1,
 (SELECT MAX(b.dateVAL)  FROM tab_A  b WHERE b.type = 'C' AND a.dateVAL >  b.dateVAL  )  as C_lag_1
FROM tab_A a
)   dt

这里是Fiddlelink。这可能不是最有效的方法。

另一种可能的解决方案是使用横向连接 (fiddle):

SELECT
    a.id,
    a.type,
    a."date",
    c.nn_row,
    c.type,
    c."date" as "date_joined"
FROM tab_A AS a
LEFT JOIN LATERAL (
    SELECT
        type,
        "date",
        row_number() OVER (PARTITION BY id, type ORDER BY id ASC, "date" DESC) as nn_row
    FROM tab_A AS b
    WHERE a."date" > b."date"
) AS c on true
WHERE c.nn_row <= 5

这会创建一个很长的 table,例如:

+----+------+------+--------+------+-------------+
| id | type | date | nn_row | type | date_joined |
+----+------+------+--------+------+-------------+
|  1 | A    |   30 |      1 | A    |          25 |
|  1 | A    |   30 |      2 | A    |          20 |
|  1 | A    |   30 |      1 | B    |          29 |
|  1 | A    |   30 |      2 | B    |          28 |
|  1 | A    |   30 |      3 | B    |          21 |
|  1 | A    |   30 |      1 | C    |          24 |
|  1 | A    |   30 |      2 | C    |          22 |
|  1 | A    |   25 |      1 | A    |          20 |
|  1 | A    |   25 |      1 | B    |          21 |
|  1 | A    |   25 |      1 | C    |          24 |
|  1 | A    |   25 |      2 | C    |          22 |
|  1 | B    |   29 |      1 | A    |          25 |
|  1 | B    |   29 |      2 | A    |          20 |
|  1 | B    |   29 |      1 | B    |          28 |
|  1 | B    |   29 |      2 | B    |          21 |
|  1 | B    |   29 |      1 | C    |          24 |
|  1 | B    |   29 |      2 | C    |          22 |
|  1 | B    |   28 |      1 | A    |          25 |
|  1 | B    |   28 |      2 | A    |          20 |
|  1 | B    |   28 |      1 | B    |          21 |
|  1 | B    |   28 |      1 | C    |          24 |
|  1 | B    |   28 |      2 | C    |          22 |
|  1 | B    |   21 |      1 | A    |          20 |
|  1 | C    |   24 |      1 | A    |          20 |
|  1 | C    |   24 |      1 | B    |          21 |
|  1 | C    |   24 |      1 | C    |          22 |
|  1 | C    |   22 |      1 | A    |          20 |
|  1 | C    |   22 |      1 | B    |          21 |
+----+------+------+--------+------+-------------+

之后您可以转到所需的输出。

然而,这对我来说适用于一个小样本,但对完整的 table Postgres 运行 磁盘 space(即使我有 50GB 可用空间):

ERROR: could not write to hash-join temporary file: No space left on device

我已在此处发布此解决方案,因为它可能适用于 tables

较小的其他人

由于 FILTER 子句确实适用于聚合函数,因此我决定 write my own

----- N = 1
-- State transition function
-- agg_state: the current state, el: new element
create or replace function lag_agg_sfunc_1(agg_state point, el float)
    returns point
    immutable
    language plpgsql
    as $$
declare
    i integer;
    stored_value float;
begin
    i := agg_state[0];
    stored_value := agg_state[1];

    i := i + 1; -- First row i=1
    if i = 1 then
        stored_value := el;
    end if;
    return point(i, stored_value);
end;
$$;

-- Final function
--DROP FUNCTION lag_agg_ffunc_1(point) CASCADE;
create or replace function lag_agg_ffunc_1(agg_state point)
    returns float
    immutable
    strict
    language plpgsql
    as $$
begin
  return agg_state[1];
end;
$$;

-- Aggregate function
drop aggregate if exists lag_agg_1(double precision);
create aggregate lag_agg_1 (float) (
    sfunc = lag_agg_sfunc_1,
    stype = point,
    finalfunc = lag_agg_ffunc_1,
    initcond = '(0,0)'
);


----- N = 2
-- State transition function
-- agg_state: the current state, el: new element
create or replace function lag_agg_sfunc_2(agg_state point, el float)
    returns point
    immutable
    language plpgsql
    as $$
declare
    i integer;
    stored_value float;
begin
    i := agg_state[0];
    stored_value := agg_state[1];

    i := i + 1; -- First row i=1
    if i = 2 then
        stored_value := el;
    end if;
    return point(i, stored_value);
end;
$$;

-- Final function
--DROP FUNCTION lag_agg_ffunc_2(point) CASCADE;
create or replace function lag_agg_ffunc_2(agg_state point)
    returns float
    immutable
    strict
    language plpgsql
    as $$
begin
  return agg_state[1];
end;
$$;

-- Aggregate function
drop aggregate if exists lag_agg_2(double precision);
create aggregate lag_agg_2 (float) (
    sfunc = lag_agg_sfunc_2,
    stype = point,
    finalfunc = lag_agg_ffunc_2,
    initcond = '(0,0)'
);

您可以将上述聚合函数 lag_agg_1lag_agg_2 与原问题中的 window 表达式一起使用:

SELECT
  id, type, "date",
  NULLIF(lag_agg_1("date") FILTER (WHERE type='A') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "A_lag_1",
  NULLIF(lag_agg_2("date") FILTER (WHERE type='A') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "A_lag_2",
  NULLIF(lag_agg_1("date") FILTER (WHERE type='B') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "B_lag_1",
  NULLIF(lag_agg_2("date") FILTER (WHERE type='B') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "B_lag_2",
  NULLIF(lag_agg_1("date") FILTER (WHERE type='C') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "C_lag_1",
  NULLIF(lag_agg_2("date") FILTER (WHERE type='C') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "C_lag_2"
FROM tab_A
ORDER BY id ASC, type, "date" DESC

与其他选项相比,这执行得相当快。一些可以改进的地方:

  • 我无法确定如何正确使用空值,所以最后通过将所有 0 转换为 NULL 来伪造结果。这在某些情况下会导致问题
  • 我刚刚复制并粘贴了每个 lag_X 的函数,因为我无法确定如何对其进行参数化

如能提供上述任何帮助,我们将不胜感激