在一个查询中滞后于多个偏移量
Lag over multiple offsets in one query
我正在尝试确定某些时间序列数据在几个时期内的最大变化。这是一个示例数据集:
drop table if exists query_table ;
create temp table query_table (groupcol TEXT, parcol TEXT, daycol Integer, val Integer);
insert into query_table values
('g1', 'p1', 1, 1),
('g1', 'p1', 2, 2),
('g1', 'p1', 3, 3),
('g1', 'p1', 4, 4),
('g1', 'p2', 1, 2),
('g1', 'p2', 2, 4),
('g1', 'p2', 3, 6),
('g1', 'p2', 4, 8),
('g2', 'p1', 1, 10),
('g2', 'p1', 2, 20),
('g2', 'p1', 3, 30),
('g2', 'p1', 4, 40),
('g2', 'p2', 1, 20),
('g2', 'p2', 2, 40),
('g2', 'p2', 3, 60),
('g2', 'p2', 4, 80);
我正在执行的基本查询如下所示(延迟 1 天):
with
change_over_time as (
select groupcol, parcol, daycol,
(val - lag(val, 1) over (partition by groupcol, parcol order by daycol) ) as change
from query_table
),
max_change as (
select groupcol, max(abs(change)) as maxchange
from change_over_time
group by groupcol
)
select * from max_change;
结果是
groupcol | maxchange
----------+------+-----------
g1 | 2
g2 | 20
我现在正在做的是发出此查询并循环遍历 Python 中所需的滞后偏移量,但这些查询需要一些时间,我想在纯 SQL 中执行此操作.此查询将 运行 在 Snowflake 中,我可以使用 Snowflake 特定的扩展。
我能想到的唯一解决方案是使用 Python 生成这样的查询:
with
change_over_time as (
select groupcol, parcol, daycol, 1 as days,
(val - lag(val, 1) over (partition by groupcol, parcol order by daycol) ) as change
from query_table
union all
select groupcol, parcol, daycol, 2 as days,
(val - lag(val, 2) over (partition by groupcol, parcol order by daycol) ) as change
from query_table
),
max_change as (
select groupcol, days, max(abs(change)) as maxchange
from change_over_time
group by groupcol, days
)
select * from max_change;
所以我得到这样的结果:
groupcol | days | maxchange
----------+------+-----------
g1 | 1 | 2
g2 | 1 | 20
g1 | 2 | 4
g2 | 2 | 40
但理想情况下,我希望仅使用 SQL 运行 许多不同的滞后(数百天,也许 1 到 730 天)并且能够以干净的方式指定滞后。
您需要创建一个 table 天,您可以以此为基础进行 change_over_time 查询。对于可变天数(比如 table 中的天数),这可以通过递归 CTE (https://docs.snowflake.com/en/user-guide/queries-cte.html#recursive-ctes-and-hierarchical-data). For a fixed number of days a values clause suffices (https://docs.snowflake.com/en/sql-reference/constructs/values.html) 来完成。
这是带有附加值子句的查询:
with
day_table(days) as (
select * from (values (1), (2), (3), (4))
),
change_over_time as (
select t.groupcol, t.parcol, t.seq, d.days,
(t.val - lag(t.val, d.days) over (partition by t.groupcol, t.parcol order by t.seq) ) as change
from query_table t
cross join day_table d
),
max_change as (
select groupcol, days, max(abs(change)) as maxchange
from change_over_time1
group by groupcol, days
)
select * from max_change;
不太确定我是否已经完全理解您的意图。
尽管我认为您甚至可以在不使用延迟的情况下得到答案。
检查以下是否符合您的要求。
WITH
day_table(days) AS (
SELECT *
FROM (VALUES (1), (2)) AS x
)
SELECT
qt1.groupcol,
qt2.daycol - qt1.daycol AS days,
MAX(ABS(qt2.val - qt1.val)) AS maxchange
FROM
query_table qt1
JOIN query_table qt2
ON qt1.groupcol = qt2.groupcol
AND qt1.parcol = qt2.parcol
AND qt2.daycol > qt1.daycol
JOIN day_table dt
ON qt2.daycol - qt1.daycol = dt.days
GROUP BY
qt1.groupcol,
qt2.daycol - qt1.daycol
ORDER BY
groupcol,
days
已更新以添加 abs 并能够限制特定范围。
我正在尝试确定某些时间序列数据在几个时期内的最大变化。这是一个示例数据集:
drop table if exists query_table ;
create temp table query_table (groupcol TEXT, parcol TEXT, daycol Integer, val Integer);
insert into query_table values
('g1', 'p1', 1, 1),
('g1', 'p1', 2, 2),
('g1', 'p1', 3, 3),
('g1', 'p1', 4, 4),
('g1', 'p2', 1, 2),
('g1', 'p2', 2, 4),
('g1', 'p2', 3, 6),
('g1', 'p2', 4, 8),
('g2', 'p1', 1, 10),
('g2', 'p1', 2, 20),
('g2', 'p1', 3, 30),
('g2', 'p1', 4, 40),
('g2', 'p2', 1, 20),
('g2', 'p2', 2, 40),
('g2', 'p2', 3, 60),
('g2', 'p2', 4, 80);
我正在执行的基本查询如下所示(延迟 1 天):
with
change_over_time as (
select groupcol, parcol, daycol,
(val - lag(val, 1) over (partition by groupcol, parcol order by daycol) ) as change
from query_table
),
max_change as (
select groupcol, max(abs(change)) as maxchange
from change_over_time
group by groupcol
)
select * from max_change;
结果是
groupcol | maxchange
----------+------+-----------
g1 | 2
g2 | 20
我现在正在做的是发出此查询并循环遍历 Python 中所需的滞后偏移量,但这些查询需要一些时间,我想在纯 SQL 中执行此操作.此查询将 运行 在 Snowflake 中,我可以使用 Snowflake 特定的扩展。
我能想到的唯一解决方案是使用 Python 生成这样的查询:
with
change_over_time as (
select groupcol, parcol, daycol, 1 as days,
(val - lag(val, 1) over (partition by groupcol, parcol order by daycol) ) as change
from query_table
union all
select groupcol, parcol, daycol, 2 as days,
(val - lag(val, 2) over (partition by groupcol, parcol order by daycol) ) as change
from query_table
),
max_change as (
select groupcol, days, max(abs(change)) as maxchange
from change_over_time
group by groupcol, days
)
select * from max_change;
所以我得到这样的结果:
groupcol | days | maxchange
----------+------+-----------
g1 | 1 | 2
g2 | 1 | 20
g1 | 2 | 4
g2 | 2 | 40
但理想情况下,我希望仅使用 SQL 运行 许多不同的滞后(数百天,也许 1 到 730 天)并且能够以干净的方式指定滞后。
您需要创建一个 table 天,您可以以此为基础进行 change_over_time 查询。对于可变天数(比如 table 中的天数),这可以通过递归 CTE (https://docs.snowflake.com/en/user-guide/queries-cte.html#recursive-ctes-and-hierarchical-data). For a fixed number of days a values clause suffices (https://docs.snowflake.com/en/sql-reference/constructs/values.html) 来完成。
这是带有附加值子句的查询:
with
day_table(days) as (
select * from (values (1), (2), (3), (4))
),
change_over_time as (
select t.groupcol, t.parcol, t.seq, d.days,
(t.val - lag(t.val, d.days) over (partition by t.groupcol, t.parcol order by t.seq) ) as change
from query_table t
cross join day_table d
),
max_change as (
select groupcol, days, max(abs(change)) as maxchange
from change_over_time1
group by groupcol, days
)
select * from max_change;
不太确定我是否已经完全理解您的意图。
尽管我认为您甚至可以在不使用延迟的情况下得到答案。
检查以下是否符合您的要求。
WITH
day_table(days) AS (
SELECT *
FROM (VALUES (1), (2)) AS x
)
SELECT
qt1.groupcol,
qt2.daycol - qt1.daycol AS days,
MAX(ABS(qt2.val - qt1.val)) AS maxchange
FROM
query_table qt1
JOIN query_table qt2
ON qt1.groupcol = qt2.groupcol
AND qt1.parcol = qt2.parcol
AND qt2.daycol > qt1.daycol
JOIN day_table dt
ON qt2.daycol - qt1.daycol = dt.days
GROUP BY
qt1.groupcol,
qt2.daycol - qt1.daycol
ORDER BY
groupcol,
days
已更新以添加 abs 并能够限制特定范围。