SQL 在非 O(N^2) 时间内使用滚动构建特征 Window

Question

我在事实 table（比如说发票历史记录）的基础上构建功能，它将简单地附加到右侧。基本发票历史记录 table 可能如下所示：

|   date     |   customer   | product  | amount  | feature c-p (past 5 days) |  ...
-----------------------------------------------------------------------------------
| 2020/01/01 |      CA      |   P1     | 10      |    NA                     |
| 2020/01/02 |      CA      |   P1     | 5       |    10   = 10              |
| 2020/01/05 |      CA      |   P1     | 20      |    15   = 5 + 10          |
| 2020/01/07 |      CA      |   P1     | 20      |    25   = 20 + 5          |
                                                  (01/01 out of range above) |
| 2020/01/15 |      CA      |   P1     | 100     |    25   = 10 + 5 + 20     |
| 2020/01/17 |      CA      |   P1     | 200     |    100  = 100             |
| 2020/01/31 |      CA      |   P1     | 20      |    0    = 0               |

起初，我们编写了使用自连接的逻辑，类似于：

select 
    c.date, 
    c.customer, 
    c.product, 
    c.amount, 
    sum(c.amount2)
from
    (select 
        i1.*,
        i2.date as date2, 
        i2.amount as amount2
    from invoice i1
    inner join invoice i2
    on i1.customer = i2.customer 
    and i1.product = i2.product 
    and i1.date < i2.date and i1.date >= i2.date - 5    -- where we customize the window
    ) c   
group by 
    c.date, 
    c.customer, 
    c.product, 
    c.amount

如果我没记错的话，这个自连接本身就是O(N^2)，但是逻辑很简单，大家可以脑补一下。但直到最近，当我们开始处理一个大的 table 时，这种方法才开始流行起来。

我之前考虑过 window 函数，但我不确定是否有更高效（计算效率和存储效率）的方法？

我的想法是使用 window 函数，但看起来我的逻辑是在运行ge 上定制的，而不是固定的 N 行回溯，相反它应该回溯 5 天？ Hive/Impala可以吗，不行的话我估计得补缺的天数然后用windows 函数。接受任何建议？

（今天我们使用的是Hive/Impala，但如果其他数据库中确实有更高效的方法，我当然愿意接受）。

更新

只是运行使用2000万行真实数据的基准测试，节省的时间是可观的：

自己加入过滤：128 分钟
使用包括日期转换的 window 功能：15 分钟（Gordon 的回答），最重要的是，这种方法保证运行不会引入重复，因为同一客户和同一产品可能会被多次购买同一天
Hive 不支持内联相关子查询，但 GBM 的解决方案应该有效避免完全笛卡尔连接

Answer 1

如果你有幸运行一个支持 range 子句的数据库 window 具有间隔的功能（例如 Postgres，从版本 11 开始），你可以做：

select
    t.*,
    sum(amount) over(
        partition by customer, product
        order by date
        range between interval '5 day' preceding and interval '1 day' preceding
    ) feature_cp
from mytable t

Demo on DB Fiddle:

date       | customer | product | amount | feature_cp
:--------- | :------- | :------ | -----: | ---------:
2020-01-01 | CA       | P1      |     10 |       null
2020-01-02 | CA       | P1      |      5 |         10
2020-01-05 | CA       | P1      |     20 |         15
2020-01-07 | CA       | P1      |     20 |         25
2020-01-15 | CA       | P1      |    100 |       null
2020-01-17 | CA       | P1      |    200 |        100
2020-01-31 | CA       | P1      |     20 |       null

否则，我建议使用相关子查询。这比您的连接查询更有效，因为它避免了外部聚合的需要：

select
    t.*,
    (
        select sum(amount) 
        from mytable t1 
        where 
            t1.customer = t.customer 
            and t1.product = t.product
            and t1.date < t.date
            and t1.date >= t.date - interval '5 day'
    ) feature_cp
from mytable t

Answer 2

Hive 支持 range，但我认为只支持数字。幸运的是，您可以将日期转换为数字并继续使用它：

select t.*,
       sum(amount) over (partition by customer, product
                         order by days
                         range between 5 preceding and 1 preceding
                        )
from (select t.*,
             datediff(date, '2000-01-01') as days
      from t
     ) t;

一个问题是很难区分 2020-01-01 和 2020-01-31。这两个 return NULL。如果你真的想区分它们，那么你可以使用 lag() 和 case:

select t.*,
       (case when datediff(date, prev_date) > 5 then 0
             when prev_date is null then null
             else sum(amount) over (partition by customer, product
                                    order by days
                                    range between 5 preceding and 1 preceding
                                   )
        end)
from (select t.*,
             datediff(date, '2000-01-01') as days,
             lag(date) over (partition by customer, product order by date) as prev_date
      from t
     ) t;

SQL 在非 O(N^2) 时间内使用滚动构建特征 Window

SQL Building Features Using Rolling Window in Non O(N^2) Time

sql

hive

impala