缺少几天时的蜂巢滚动平均值

hive rolling average when some days missing

我正在使用 Hive 处理庞大的数据集,并尝试过去一周的滚动平均值。如果一天的数据缺失,我们要考虑 6 天的滚动平均值。 自连接需要很长时间,所以尝试了 window 功能。

例如

Select date,avg(volume) over (order by date ROWS between 6 preceding AND current row) as Moving_AVG
From job_history;

无论如何,这可以通过配置单元 window 功能完成吗?

Range between 6 preceding and current row


select      date
           ,volume
           ,avg (volume) over 
            (   
                order by    date
                range       between 6 preceding and current row
            ) as moving_avg 
            
from        job_history
;

演示

create table job_history (date date,volume int);

insert into job_history values 
    ('2017-01-01', 1),('2017-01-02', 2),('2017-01-05', 3),('2017-01-06', 4),('2017-01-08', 5)
   ,('2017-01-09', 6),('2017-01-10', 7),('2017-01-10', 8),('2017-01-10', 9),('2017-01-11',10)
   ,('2017-01-11',11),('2017-01-12',12),('2017-01-13',13),('2017-01-14',14),('2017-01-17',15)
;   

select * from job_history
;

+------------------+--------------------+
| job_history.date | job_history.volume |
+------------------+--------------------+
| 2017-01-01       | 1                  |
+------------------+--------------------+
| 2017-01-02       | 2                  |
+------------------+--------------------+
| 2017-01-05       | 3                  |
+------------------+--------------------+
| 2017-01-06       | 4                  |
+------------------+--------------------+
| 2017-01-08       | 5                  |
+------------------+--------------------+
| 2017-01-09       | 6                  |
+------------------+--------------------+
| 2017-01-10       | 7                  |
+------------------+--------------------+
| 2017-01-10       | 8                  |
+------------------+--------------------+
| 2017-01-10       | 9                  |
+------------------+--------------------+
| 2017-01-11       | 10                 |
+------------------+--------------------+
| 2017-01-11       | 11                 |
+------------------+--------------------+
| 2017-01-12       | 12                 |
+------------------+--------------------+
| 2017-01-13       | 13                 |
+------------------+--------------------+
| 2017-01-14       | 14                 |
+------------------+--------------------+
| 2017-01-17       | 15                 |
+------------------+--------------------+

select      date
           ,volume
           ,avg (volume) over 
            (   
                order by    date
                range       between 6 preceding and current row
            ) as moving_avg 
            
from        job_history
;

+------------+--------+------------+
| date       | volume | moving_avg |
+------------+--------+------------+
| 2017-01-01 | 1      | 1.0        |
+------------+--------+------------+
| 2017-01-02 | 2      | 1.5        |
+------------+--------+------------+
| 2017-01-05 | 3      | 2.0        |
+------------+--------+------------+
| 2017-01-06 | 4      | 2.5        |
+------------+--------+------------+
| 2017-01-08 | 5      | 3.5        |
+------------+--------+------------+
| 2017-01-09 | 6      | 4.5        |
+------------+--------+------------+
| 2017-01-10 | 8      | 6.0        |
+------------+--------+------------+
| 2017-01-10 | 9      | 6.0        |
+------------+--------+------------+
| 2017-01-10 | 7      | 6.0        |
+------------+--------+------------+
| 2017-01-11 | 10     | 7.0        |
+------------+--------+------------+
| 2017-01-11 | 11     | 7.0        |
+------------+--------+------------+
| 2017-01-12 | 12     | 8.0        |
+------------+--------+------------+
| 2017-01-13 | 13     | 9.0        |
+------------+--------+------------+
| 2017-01-14 | 14     | 9.5        |
+------------+--------+------------+
| 2017-01-17 | 15     | 12.5       |
+------------+--------+------------+

正如 David 上面提到的,您可以使用 range between。但是,没有 INTERVAL 函数。相反,我们可以先将日期传输到 ts。

    AVG(metric) 
    OVER (PARTITION BY author_id
    ORDER BY unix_timestamp(to_date(date))
    RANGE BETWEEN 604800 PRECEDING AND 86400 PRECEDING
         ) AS metric_rolling_7_day_average