SQL 在非 O(N^2) 时间内使用滚动构建特征 Window

SQL Building Features Using Rolling Window in Non O(N^2) Time

我在事实 table(比如说发票历史记录)的基础上构建功能,它将简单地附加到右侧。基本发票历史记录 table 可能如下所示:

|   date     |   customer   | product  | amount  | feature c-p (past 5 days) |  ...
-----------------------------------------------------------------------------------
| 2020/01/01 |      CA      |   P1     | 10      |    NA                     |
| 2020/01/02 |      CA      |   P1     | 5       |    10   = 10              |
| 2020/01/05 |      CA      |   P1     | 20      |    15   = 5 + 10          |
| 2020/01/07 |      CA      |   P1     | 20      |    25   = 20 + 5          |
                                                  (01/01 out of range above) |
| 2020/01/15 |      CA      |   P1     | 100     |    25   = 10 + 5 + 20     |
| 2020/01/17 |      CA      |   P1     | 200     |    100  = 100             |
| 2020/01/31 |      CA      |   P1     | 20      |    0    = 0               |

起初,我们编写了使用自连接的逻辑,类似于:

select 
    c.date, 
    c.customer, 
    c.product, 
    c.amount, 
    sum(c.amount2)
from
    (select 
        i1.*,
        i2.date as date2, 
        i2.amount as amount2
    from invoice i1
    inner join invoice i2
    on i1.customer = i2.customer 
    and i1.product = i2.product 
    and i1.date < i2.date and i1.date >= i2.date - 5    -- where we customize the window
    ) c   
group by 
    c.date, 
    c.customer, 
    c.product, 
    c.amount

如果我没记错的话,这个自连接本身就是O(N^2),但是逻辑很简单,大家可以脑补一下。但直到最近,当我们开始处理一个大的 table 时,这种方法才开始流行起来。

我之前考虑过 window 函数,但我不确定是否有更高效(计算效率和存储效率)的方法?

我的想法是使用 window 函数,但看起来我的逻辑是在 运行ge 上定制的,而不是固定的 N 行回溯,相反它应该回溯 5 天? Hive/Impala可以吗,不行的话我估计得补缺的天数然后用windows 函数。接受任何建议?

(今天我们使用的是Hive/Impala,但如果其他数据库中确实有更高效的方法,我当然愿意接受)。


更新

只是运行使用2000万行真实数据的基准测试,节省的时间是可观的:

如果你有幸 运行 一个支持 range 子句的数据库 window 具有间隔的功能(例如 Postgres,从版本 11 开始),你可以做:

select
    t.*,
    sum(amount) over(
        partition by customer, product
        order by date
        range between interval '5 day' preceding and interval '1 day' preceding
    ) feature_cp
from mytable t

Demo on DB Fiddle:

date       | customer | product | amount | feature_cp
:--------- | :------- | :------ | -----: | ---------:
2020-01-01 | CA       | P1      |     10 |       null
2020-01-02 | CA       | P1      |      5 |         10
2020-01-05 | CA       | P1      |     20 |         15
2020-01-07 | CA       | P1      |     20 |         25
2020-01-15 | CA       | P1      |    100 |       null
2020-01-17 | CA       | P1      |    200 |        100
2020-01-31 | CA       | P1      |     20 |       null

否则,我建议使用相关子查询。这比您的连接查询更有效,因为它避免了外部聚合的需要:

select
    t.*,
    (
        select sum(amount) 
        from mytable t1 
        where 
            t1.customer = t.customer 
            and t1.product = t.product
            and t1.date < t.date
            and t1.date >= t.date - interval '5 day'
    ) feature_cp
from mytable t

Hive 支持 range,但我认为只支持数字。幸运的是,您可以将日期转换为数字并继续使用它:

select t.*,
       sum(amount) over (partition by customer, product
                         order by days
                         range between 5 preceding and 1 preceding
                        )
from (select t.*,
             datediff(date, '2000-01-01') as days
      from t
     ) t;

一个问题是很难区分 2020-01-01 和 2020-01-31。这两个 return NULL。如果你真的想区分它们,那么你可以使用 lag()case:

select t.*,
       (case when datediff(date, prev_date) > 5 then 0
             when prev_date is null then null
             else sum(amount) over (partition by customer, product
                                    order by days
                                    range between 5 preceding and 1 preceding
                                   )
        end)
from (select t.*,
             datediff(date, '2000-01-01') as days,
             lag(date) over (partition by customer, product order by date) as prev_date
      from t
     ) t;