按组填补空白(不等于开始日期)

Gap filling by group (not equal start date)

我有一个数据集,其中每个 sku(按商店分组)都有不同的开始日期:

      date       sku     store  Units   balance
0  2019-10-01  103993.0    001    0.0     10.0
1  2019-10-02  103993.0    001    1.0      9.0
2  2019-10-04  103993.0    001    1.0      8.0


3  2019-10-02  103994.0    002    1.0     11.0
4  2019-10-04  103994.0    002    1.0     10.0
5  2019-10-05  103994.0    002    0.0     10.0

6  2019-09-30  103991.0    012    0.0     14.0
7  2019-10-02  103991.0    012    1.0     13.0
8  2019-10-04  103991.0    012    1.0     12.0
9  2019-10-05  103991.0    012    0.0     10.0

我需要填补从不相等的开始日期到结束日期的日期差距(应该等于所有产品 - 所有产品的最大日期)。

我对这个例子的预期输出是:

      date       sku     store  Units   balance
0  2019-10-01  103993.0    001    0.0     10.0
1  2019-10-02  103993.0    001    1.0      9.0
2  2019-10-03  103993.0    001    0        9.0
3  2019-10-04  103993.0    001    1.0      8.0
4  2019-10-05  103993.0    001    0        8.0

5  2019-10-02  103994.0    002    1.0     11.0
5  2019-10-03  103994.0    002    0       11.0
6  2019-10-04  103994.0    002    1.0     10.0
7  2019-10-05  103994.0    002    0.0     10.0

8   2019-09-30  103991.0    012    0.0     14.0
9   2019-10-01  103991.0    012    0       14.0
10  2019-10-02  103991.0    012    1.0     13.0
11  2019-10-03  103991.0    012    0       13.0
12  2019-10-04  103991.0    012    1.0     12.0
13  2019-10-05  103991.0    012    0.0     10.0

我注意到 postgres 与 timescaleDB 一起工作,它有一些功能,如:

locf and time_bucket_gapfill function

我试过 github 上建议的这个功能:

 SELECT * 
    FROM (SELECT 
        time_bucket_gapfill('1 day', date, '2019-09-30', '2019-10-05') as day, 
        sku, 
        store, 
        units,
        COALESCE(units, 0) as units_filled, 
        locf(last(balance, date)) as balance 
        FROM train
        WHERE date >= '2019-09-30' 
        GROUP BY sku, store, units, day ) f 
    WHERE balance IS NOT NULL

不过对我来说有点小技巧,要正常工作。

我会推荐:

select gs.dte, tt.store, tt.sku, coalesce(t.units, 0) as units,
       coalesce(t.balance,
                max(t.balance) over (partition by tt.store, tt.sku order by gs.dte)
               )
from (select store, sku, min(date) as min_date,
             max(max(date)) over () as max_date
      from train
      group by store, sku
     ) tt cross join lateral
     generate_series(tt.min_date, tt.max_date, interval '1 day') gs(dte) left join
     train t
      on tt.store = t.store and
         tt.sku = t.sku and
         tt.date = gs.dte;

此特定版本假定 balance 始终在减少(如您的示例数据中所示)。如果不是这样,可以调整逻辑。