获取第一个、第二个、第三个...最后一个值并选择行(Window 具有过滤器和滞后功能)
Take first, second, third ... last value and selecting rows (Window function with filter and lag)
我想执行带有过滤子句的 window 函数,例如:
LAG("date", 1) FILTER (WHERE type='A') OVER (PARTITION BY id ORDER BY id ASC) AS "A_lag_1"
但是,Postgres 不支持此操作,但我无法确定还可以如何操作。详情如下
挑战
输入tab_A
:
+----+------+------+
| id | type | date |
+----+------+------+
| 1 | A | 30 |
| 1 | A | 25 |
| 1 | A | 20 |
| 1 | B | 29 |
| 1 | B | 28 |
| 1 | B | 21 |
| 1 | C | 24 |
| 1 | C | 22 |
+----+------+------+
期望的输出:
+----+------+------+---------+---------+---------+---------+---------+---------+
| id | type | date | A_lag_1 | A_lag_2 | B_lag_1 | B_lag_2 | C_lag_1 | C_lag_2 |
+----+------+------+---------+---------+---------+---------+---------+---------+
| 1 | A | 30 | 25 | 20 | 29 | 28 | 24 | 22 |
| 1 | A | 25 | 20 | | | | 24 | 22 |
| 1 | A | 20 | | | | | | |
| 1 | B | 29 | 25 | 20 | 28 | 21 | 24 | 22 |
| 1 | B | 28 | 25 | 20 | 21 | | 24 | 22 |
| 1 | B | 21 | 20 | | | | 24 | 22 |
| 1 | C | 24 | 20 | | 21 | | 22 | |
| 1 | C | 22 | 20 | | 21 | | | |
+----+------+------+---------+---------+---------+---------+---------+---------+
换言之:
- 对于每一行 select 之前出现的所有行(参见
date
列)
- 然后对于每个
type
('A', 'B', 'C') 将最近的 date
放在 A_lag_1
中,第二个A_lag_2
type
'A' 中的最新(按日期)值,以及 B_lag_1
、B_lag_2
中 'B' 等
上面的例子非常简单,在我的实际用例中会有更多的 id
值,更多的滞后列迭代 A_lag_X
和类型。
可能的解决方案
这个挑战似乎非常适合 window function,因为我想保留相同数量的行 tab_A
并附加与该行相关但在过去的信息。
所以使用 window 函数构造它 (sqlfiddle):
SELECT
id, type, "date",
LAG("date", 1) FILTER (WHERE type='A') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "A_lag_1",
LAG("date", 2) FILTER (WHERE type='A') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "A_lag_2",
LAG("date", 1) FILTER (WHERE type='B') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "B_lag_1",
LAG("date", 2) FILTER (WHERE type='B') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "B_lag_2",
LAG("date", 1) FILTER (WHERE type='C') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "C_lag_1",
LAG("date", 2) FILTER (WHERE type='C') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "C_lag_2"
FROM tab_A
但是,我收到以下错误:
ERROR: FILTER is not implemented for non-aggregate window functions
Position: 30
尽管 documentation 中引用了此错误,但我无法确定另一种方法。
如有任何帮助,我们将不胜感激。
其他问题:
- 1. 这个答案依赖于使用聚合函数,例如
max
。但是,这在尝试检索倒数第二行、倒数第三行等时不起作用。
你可以试试下面的方法。
SELECT
dt.* ,
(SELECT MAX(b.dateVAL) FROM tab_A b WHERE b.type = 'A' AND dt.A_lag_1 > b.dateVAL ) AS "A_lag_2",
(SELECT MAX(b.dateVAL) FROM tab_A b WHERE b.type = 'B' AND dt.B_lag_1 > b.dateVAL ) AS "B_lag_2" ,
(SELECT MAX(b.dateVAL) FROM tab_A b WHERE b.type = 'C' AND dt.C_lag_1 > b.dateVAL ) AS "C_lag_2"
FROM
(
SELECT
a.id, a.type, a.dateVAL,
(SELECT MAX(b.dateVAL) FROM tab_A b WHERE b.type = 'A' AND a.dateVAL > b.dateVAL ) as A_lag_1,
(SELECT MAX(b.dateVAL) FROM tab_A b WHERE b.type = 'B' AND a.dateVAL > b.dateVAL ) as B_lag_1,
(SELECT MAX(b.dateVAL) FROM tab_A b WHERE b.type = 'C' AND a.dateVAL > b.dateVAL ) as C_lag_1
FROM tab_A a
) dt
这里是Fiddlelink。这可能不是最有效的方法。
另一种可能的解决方案是使用横向连接 (fiddle):
SELECT
a.id,
a.type,
a."date",
c.nn_row,
c.type,
c."date" as "date_joined"
FROM tab_A AS a
LEFT JOIN LATERAL (
SELECT
type,
"date",
row_number() OVER (PARTITION BY id, type ORDER BY id ASC, "date" DESC) as nn_row
FROM tab_A AS b
WHERE a."date" > b."date"
) AS c on true
WHERE c.nn_row <= 5
这会创建一个很长的 table,例如:
+----+------+------+--------+------+-------------+
| id | type | date | nn_row | type | date_joined |
+----+------+------+--------+------+-------------+
| 1 | A | 30 | 1 | A | 25 |
| 1 | A | 30 | 2 | A | 20 |
| 1 | A | 30 | 1 | B | 29 |
| 1 | A | 30 | 2 | B | 28 |
| 1 | A | 30 | 3 | B | 21 |
| 1 | A | 30 | 1 | C | 24 |
| 1 | A | 30 | 2 | C | 22 |
| 1 | A | 25 | 1 | A | 20 |
| 1 | A | 25 | 1 | B | 21 |
| 1 | A | 25 | 1 | C | 24 |
| 1 | A | 25 | 2 | C | 22 |
| 1 | B | 29 | 1 | A | 25 |
| 1 | B | 29 | 2 | A | 20 |
| 1 | B | 29 | 1 | B | 28 |
| 1 | B | 29 | 2 | B | 21 |
| 1 | B | 29 | 1 | C | 24 |
| 1 | B | 29 | 2 | C | 22 |
| 1 | B | 28 | 1 | A | 25 |
| 1 | B | 28 | 2 | A | 20 |
| 1 | B | 28 | 1 | B | 21 |
| 1 | B | 28 | 1 | C | 24 |
| 1 | B | 28 | 2 | C | 22 |
| 1 | B | 21 | 1 | A | 20 |
| 1 | C | 24 | 1 | A | 20 |
| 1 | C | 24 | 1 | B | 21 |
| 1 | C | 24 | 1 | C | 22 |
| 1 | C | 22 | 1 | A | 20 |
| 1 | C | 22 | 1 | B | 21 |
+----+------+------+--------+------+-------------+
之后您可以转到所需的输出。
然而,这对我来说适用于一个小样本,但对完整的 table Postgres 运行 磁盘 space(即使我有 50GB 可用空间):
ERROR: could not write to hash-join temporary file: No space left on device
我已在此处发布此解决方案,因为它可能适用于 tables
较小的其他人
由于 FILTER
子句确实适用于聚合函数,因此我决定 write my own。
----- N = 1
-- State transition function
-- agg_state: the current state, el: new element
create or replace function lag_agg_sfunc_1(agg_state point, el float)
returns point
immutable
language plpgsql
as $$
declare
i integer;
stored_value float;
begin
i := agg_state[0];
stored_value := agg_state[1];
i := i + 1; -- First row i=1
if i = 1 then
stored_value := el;
end if;
return point(i, stored_value);
end;
$$;
-- Final function
--DROP FUNCTION lag_agg_ffunc_1(point) CASCADE;
create or replace function lag_agg_ffunc_1(agg_state point)
returns float
immutable
strict
language plpgsql
as $$
begin
return agg_state[1];
end;
$$;
-- Aggregate function
drop aggregate if exists lag_agg_1(double precision);
create aggregate lag_agg_1 (float) (
sfunc = lag_agg_sfunc_1,
stype = point,
finalfunc = lag_agg_ffunc_1,
initcond = '(0,0)'
);
----- N = 2
-- State transition function
-- agg_state: the current state, el: new element
create or replace function lag_agg_sfunc_2(agg_state point, el float)
returns point
immutable
language plpgsql
as $$
declare
i integer;
stored_value float;
begin
i := agg_state[0];
stored_value := agg_state[1];
i := i + 1; -- First row i=1
if i = 2 then
stored_value := el;
end if;
return point(i, stored_value);
end;
$$;
-- Final function
--DROP FUNCTION lag_agg_ffunc_2(point) CASCADE;
create or replace function lag_agg_ffunc_2(agg_state point)
returns float
immutable
strict
language plpgsql
as $$
begin
return agg_state[1];
end;
$$;
-- Aggregate function
drop aggregate if exists lag_agg_2(double precision);
create aggregate lag_agg_2 (float) (
sfunc = lag_agg_sfunc_2,
stype = point,
finalfunc = lag_agg_ffunc_2,
initcond = '(0,0)'
);
您可以将上述聚合函数 lag_agg_1
和 lag_agg_2
与原问题中的 window 表达式一起使用:
SELECT
id, type, "date",
NULLIF(lag_agg_1("date") FILTER (WHERE type='A') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "A_lag_1",
NULLIF(lag_agg_2("date") FILTER (WHERE type='A') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "A_lag_2",
NULLIF(lag_agg_1("date") FILTER (WHERE type='B') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "B_lag_1",
NULLIF(lag_agg_2("date") FILTER (WHERE type='B') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "B_lag_2",
NULLIF(lag_agg_1("date") FILTER (WHERE type='C') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "C_lag_1",
NULLIF(lag_agg_2("date") FILTER (WHERE type='C') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "C_lag_2"
FROM tab_A
ORDER BY id ASC, type, "date" DESC
与其他选项相比,这执行得相当快。一些可以改进的地方:
- 我无法确定如何正确使用空值,所以最后通过将所有 0 转换为 NULL 来伪造结果。这在某些情况下会导致问题
- 我刚刚复制并粘贴了每个 lag_X 的函数,因为我无法确定如何对其进行参数化
如能提供上述任何帮助,我们将不胜感激
我想执行带有过滤子句的 window 函数,例如:
LAG("date", 1) FILTER (WHERE type='A') OVER (PARTITION BY id ORDER BY id ASC) AS "A_lag_1"
但是,Postgres 不支持此操作,但我无法确定还可以如何操作。详情如下
挑战
输入tab_A
:
+----+------+------+
| id | type | date |
+----+------+------+
| 1 | A | 30 |
| 1 | A | 25 |
| 1 | A | 20 |
| 1 | B | 29 |
| 1 | B | 28 |
| 1 | B | 21 |
| 1 | C | 24 |
| 1 | C | 22 |
+----+------+------+
期望的输出:
+----+------+------+---------+---------+---------+---------+---------+---------+
| id | type | date | A_lag_1 | A_lag_2 | B_lag_1 | B_lag_2 | C_lag_1 | C_lag_2 |
+----+------+------+---------+---------+---------+---------+---------+---------+
| 1 | A | 30 | 25 | 20 | 29 | 28 | 24 | 22 |
| 1 | A | 25 | 20 | | | | 24 | 22 |
| 1 | A | 20 | | | | | | |
| 1 | B | 29 | 25 | 20 | 28 | 21 | 24 | 22 |
| 1 | B | 28 | 25 | 20 | 21 | | 24 | 22 |
| 1 | B | 21 | 20 | | | | 24 | 22 |
| 1 | C | 24 | 20 | | 21 | | 22 | |
| 1 | C | 22 | 20 | | 21 | | | |
+----+------+------+---------+---------+---------+---------+---------+---------+
换言之:
- 对于每一行 select 之前出现的所有行(参见
date
列) - 然后对于每个
type
('A', 'B', 'C') 将最近的date
放在A_lag_1
中,第二个A_lag_2
type
'A' 中的最新(按日期)值,以及B_lag_1
、B_lag_2
中 'B' 等
上面的例子非常简单,在我的实际用例中会有更多的 id
值,更多的滞后列迭代 A_lag_X
和类型。
可能的解决方案
这个挑战似乎非常适合 window function,因为我想保留相同数量的行 tab_A
并附加与该行相关但在过去的信息。
所以使用 window 函数构造它 (sqlfiddle):
SELECT
id, type, "date",
LAG("date", 1) FILTER (WHERE type='A') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "A_lag_1",
LAG("date", 2) FILTER (WHERE type='A') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "A_lag_2",
LAG("date", 1) FILTER (WHERE type='B') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "B_lag_1",
LAG("date", 2) FILTER (WHERE type='B') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "B_lag_2",
LAG("date", 1) FILTER (WHERE type='C') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "C_lag_1",
LAG("date", 2) FILTER (WHERE type='C') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "C_lag_2"
FROM tab_A
但是,我收到以下错误:
ERROR: FILTER is not implemented for non-aggregate window functions Position: 30
尽管 documentation 中引用了此错误,但我无法确定另一种方法。
如有任何帮助,我们将不胜感激。
其他问题:
- 1. 这个答案依赖于使用聚合函数,例如
max
。但是,这在尝试检索倒数第二行、倒数第三行等时不起作用。
你可以试试下面的方法。
SELECT
dt.* ,
(SELECT MAX(b.dateVAL) FROM tab_A b WHERE b.type = 'A' AND dt.A_lag_1 > b.dateVAL ) AS "A_lag_2",
(SELECT MAX(b.dateVAL) FROM tab_A b WHERE b.type = 'B' AND dt.B_lag_1 > b.dateVAL ) AS "B_lag_2" ,
(SELECT MAX(b.dateVAL) FROM tab_A b WHERE b.type = 'C' AND dt.C_lag_1 > b.dateVAL ) AS "C_lag_2"
FROM
(
SELECT
a.id, a.type, a.dateVAL,
(SELECT MAX(b.dateVAL) FROM tab_A b WHERE b.type = 'A' AND a.dateVAL > b.dateVAL ) as A_lag_1,
(SELECT MAX(b.dateVAL) FROM tab_A b WHERE b.type = 'B' AND a.dateVAL > b.dateVAL ) as B_lag_1,
(SELECT MAX(b.dateVAL) FROM tab_A b WHERE b.type = 'C' AND a.dateVAL > b.dateVAL ) as C_lag_1
FROM tab_A a
) dt
这里是Fiddlelink。这可能不是最有效的方法。
另一种可能的解决方案是使用横向连接 (fiddle):
SELECT
a.id,
a.type,
a."date",
c.nn_row,
c.type,
c."date" as "date_joined"
FROM tab_A AS a
LEFT JOIN LATERAL (
SELECT
type,
"date",
row_number() OVER (PARTITION BY id, type ORDER BY id ASC, "date" DESC) as nn_row
FROM tab_A AS b
WHERE a."date" > b."date"
) AS c on true
WHERE c.nn_row <= 5
这会创建一个很长的 table,例如:
+----+------+------+--------+------+-------------+
| id | type | date | nn_row | type | date_joined |
+----+------+------+--------+------+-------------+
| 1 | A | 30 | 1 | A | 25 |
| 1 | A | 30 | 2 | A | 20 |
| 1 | A | 30 | 1 | B | 29 |
| 1 | A | 30 | 2 | B | 28 |
| 1 | A | 30 | 3 | B | 21 |
| 1 | A | 30 | 1 | C | 24 |
| 1 | A | 30 | 2 | C | 22 |
| 1 | A | 25 | 1 | A | 20 |
| 1 | A | 25 | 1 | B | 21 |
| 1 | A | 25 | 1 | C | 24 |
| 1 | A | 25 | 2 | C | 22 |
| 1 | B | 29 | 1 | A | 25 |
| 1 | B | 29 | 2 | A | 20 |
| 1 | B | 29 | 1 | B | 28 |
| 1 | B | 29 | 2 | B | 21 |
| 1 | B | 29 | 1 | C | 24 |
| 1 | B | 29 | 2 | C | 22 |
| 1 | B | 28 | 1 | A | 25 |
| 1 | B | 28 | 2 | A | 20 |
| 1 | B | 28 | 1 | B | 21 |
| 1 | B | 28 | 1 | C | 24 |
| 1 | B | 28 | 2 | C | 22 |
| 1 | B | 21 | 1 | A | 20 |
| 1 | C | 24 | 1 | A | 20 |
| 1 | C | 24 | 1 | B | 21 |
| 1 | C | 24 | 1 | C | 22 |
| 1 | C | 22 | 1 | A | 20 |
| 1 | C | 22 | 1 | B | 21 |
+----+------+------+--------+------+-------------+
之后您可以转到所需的输出。
然而,这对我来说适用于一个小样本,但对完整的 table Postgres 运行 磁盘 space(即使我有 50GB 可用空间):
ERROR: could not write to hash-join temporary file: No space left on device
我已在此处发布此解决方案,因为它可能适用于 tables
较小的其他人由于 FILTER
子句确实适用于聚合函数,因此我决定 write my own。
----- N = 1
-- State transition function
-- agg_state: the current state, el: new element
create or replace function lag_agg_sfunc_1(agg_state point, el float)
returns point
immutable
language plpgsql
as $$
declare
i integer;
stored_value float;
begin
i := agg_state[0];
stored_value := agg_state[1];
i := i + 1; -- First row i=1
if i = 1 then
stored_value := el;
end if;
return point(i, stored_value);
end;
$$;
-- Final function
--DROP FUNCTION lag_agg_ffunc_1(point) CASCADE;
create or replace function lag_agg_ffunc_1(agg_state point)
returns float
immutable
strict
language plpgsql
as $$
begin
return agg_state[1];
end;
$$;
-- Aggregate function
drop aggregate if exists lag_agg_1(double precision);
create aggregate lag_agg_1 (float) (
sfunc = lag_agg_sfunc_1,
stype = point,
finalfunc = lag_agg_ffunc_1,
initcond = '(0,0)'
);
----- N = 2
-- State transition function
-- agg_state: the current state, el: new element
create or replace function lag_agg_sfunc_2(agg_state point, el float)
returns point
immutable
language plpgsql
as $$
declare
i integer;
stored_value float;
begin
i := agg_state[0];
stored_value := agg_state[1];
i := i + 1; -- First row i=1
if i = 2 then
stored_value := el;
end if;
return point(i, stored_value);
end;
$$;
-- Final function
--DROP FUNCTION lag_agg_ffunc_2(point) CASCADE;
create or replace function lag_agg_ffunc_2(agg_state point)
returns float
immutable
strict
language plpgsql
as $$
begin
return agg_state[1];
end;
$$;
-- Aggregate function
drop aggregate if exists lag_agg_2(double precision);
create aggregate lag_agg_2 (float) (
sfunc = lag_agg_sfunc_2,
stype = point,
finalfunc = lag_agg_ffunc_2,
initcond = '(0,0)'
);
您可以将上述聚合函数 lag_agg_1
和 lag_agg_2
与原问题中的 window 表达式一起使用:
SELECT
id, type, "date",
NULLIF(lag_agg_1("date") FILTER (WHERE type='A') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "A_lag_1",
NULLIF(lag_agg_2("date") FILTER (WHERE type='A') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "A_lag_2",
NULLIF(lag_agg_1("date") FILTER (WHERE type='B') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "B_lag_1",
NULLIF(lag_agg_2("date") FILTER (WHERE type='B') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "B_lag_2",
NULLIF(lag_agg_1("date") FILTER (WHERE type='C') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "C_lag_1",
NULLIF(lag_agg_2("date") FILTER (WHERE type='C') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "C_lag_2"
FROM tab_A
ORDER BY id ASC, type, "date" DESC
与其他选项相比,这执行得相当快。一些可以改进的地方:
- 我无法确定如何正确使用空值,所以最后通过将所有 0 转换为 NULL 来伪造结果。这在某些情况下会导致问题
- 我刚刚复制并粘贴了每个 lag_X 的函数,因为我无法确定如何对其进行参数化
如能提供上述任何帮助,我们将不胜感激