在一个查询中滞后于多个偏移量

Lag over multiple offsets in one query

我正在尝试确定某些时间序列数据在几个时期内的最大变化。这是一个示例数据集:

drop table if exists query_table ;
create temp table query_table (groupcol TEXT, parcol TEXT, daycol Integer, val Integer);

insert into query_table values 
    ('g1', 'p1', 1, 1),
    ('g1', 'p1', 2, 2),
    ('g1', 'p1', 3, 3),
    ('g1', 'p1', 4, 4),
    ('g1', 'p2', 1, 2),
    ('g1', 'p2', 2, 4),
    ('g1', 'p2', 3, 6),
    ('g1', 'p2', 4, 8),
    ('g2', 'p1', 1, 10),
    ('g2', 'p1', 2, 20),
    ('g2', 'p1', 3, 30),
    ('g2', 'p1', 4, 40),
    ('g2', 'p2', 1, 20),
    ('g2', 'p2', 2, 40),
    ('g2', 'p2', 3, 60),
    ('g2', 'p2', 4, 80);

我正在执行的基本查询如下所示(延迟 1 天):

with
  change_over_time as (
    select groupcol, parcol, daycol,
      (val - lag(val, 1) over (partition by groupcol, parcol order by daycol) ) as change
      from query_table
  ),
  max_change as (
    select groupcol, max(abs(change)) as maxchange
    from change_over_time
    group by groupcol
  )
select * from max_change;

结果是

groupcol  | maxchange
----------+------+-----------
 g1       |         2
 g2       |        20

我现在正在做的是发出此查询并循环遍历 Python 中所需的滞后偏移量,但这些查询需要一些时间,我想在纯 SQL 中执行此操作.此查询将 运行 在 Snowflake 中,我可以使用 Snowflake 特定的扩展。

我能想到的唯一解决方案是使用 Python 生成这样的查询:

with
  change_over_time as (
      
        select groupcol, parcol, daycol, 1 as days,
          (val - lag(val, 1) over (partition by groupcol, parcol order by daycol) ) as change
          from query_table
    
    union all
  
        select groupcol, parcol, daycol, 2 as days,
          (val - lag(val, 2) over (partition by groupcol, parcol order by daycol) ) as change
          from query_table
   
    ),
   max_change as (
        select groupcol, days, max(abs(change)) as maxchange
        from change_over_time
        group by groupcol, days
  )
select * from max_change;

所以我得到这样的结果:

 groupcol | days | maxchange
----------+------+-----------
 g1       |    1 |         2
 g2       |    1 |        20
 g1       |    2 |         4
 g2       |    2 |        40

但理想情况下,我希望仅使用 SQL 运行 许多不同的滞后(数百天,也许 1 到 730 天)并且能够以干净的方式指定滞后。

您需要创建一个 table 天,您可以以此为基础进行 change_over_time 查询。对于可变天数(比如 table 中的天数),这可以通过递归 CTE (https://docs.snowflake.com/en/user-guide/queries-cte.html#recursive-ctes-and-hierarchical-data). For a fixed number of days a values clause suffices (https://docs.snowflake.com/en/sql-reference/constructs/values.html) 来完成。

这是带有附加值子句的查询:

with
  day_table(days) as (
    select * from (values (1), (2), (3), (4))
  ),
  change_over_time as (
    select t.groupcol, t.parcol, t.seq, d.days,
      (t.val - lag(t.val, d.days) over (partition by t.groupcol, t.parcol order by t.seq) ) as change
      from query_table t
      cross join day_table d
  ),
  max_change as (
    select groupcol, days, max(abs(change)) as maxchange
    from change_over_time1
    group by groupcol, days
  )
select * from max_change;

不太确定我是否已经完全理解您的意图。

尽管我认为您甚至可以在不使用延迟的情况下得到答案。

检查以下是否符合您的要求。

WITH
    day_table(days) AS (
        SELECT *
        FROM (VALUES (1), (2)) AS x
    )
SELECT
    qt1.groupcol,
    qt2.daycol - qt1.daycol     AS days,
    MAX(ABS(qt2.val - qt1.val)) AS maxchange
FROM
    query_table qt1
        JOIN query_table qt2
             ON qt1.groupcol = qt2.groupcol
                 AND qt1.parcol = qt2.parcol
                 AND qt2.daycol > qt1.daycol
        JOIN day_table dt
             ON qt2.daycol - qt1.daycol = dt.days
GROUP BY
    qt1.groupcol,
    qt2.daycol - qt1.daycol
ORDER BY
    groupcol,
    days

已更新以添加 abs 并能够限制特定范围。