缺少几天时的蜂巢滚动平均值
hive rolling average when some days missing
我正在使用 Hive 处理庞大的数据集,并尝试过去一周的滚动平均值。如果一天的数据缺失,我们要考虑 6 天的滚动平均值。
自连接需要很长时间,所以尝试了 window 功能。
例如
Select date,avg(volume) over (order by date ROWS between 6 preceding AND current row) as Moving_AVG
From job_history;
无论如何,这可以通过配置单元 window 功能完成吗?
Range between 6 preceding and current row
select date
,volume
,avg (volume) over
(
order by date
range between 6 preceding and current row
) as moving_avg
from job_history
;
演示
create table job_history (date date,volume int);
insert into job_history values
('2017-01-01', 1),('2017-01-02', 2),('2017-01-05', 3),('2017-01-06', 4),('2017-01-08', 5)
,('2017-01-09', 6),('2017-01-10', 7),('2017-01-10', 8),('2017-01-10', 9),('2017-01-11',10)
,('2017-01-11',11),('2017-01-12',12),('2017-01-13',13),('2017-01-14',14),('2017-01-17',15)
;
select * from job_history
;
+------------------+--------------------+
| job_history.date | job_history.volume |
+------------------+--------------------+
| 2017-01-01 | 1 |
+------------------+--------------------+
| 2017-01-02 | 2 |
+------------------+--------------------+
| 2017-01-05 | 3 |
+------------------+--------------------+
| 2017-01-06 | 4 |
+------------------+--------------------+
| 2017-01-08 | 5 |
+------------------+--------------------+
| 2017-01-09 | 6 |
+------------------+--------------------+
| 2017-01-10 | 7 |
+------------------+--------------------+
| 2017-01-10 | 8 |
+------------------+--------------------+
| 2017-01-10 | 9 |
+------------------+--------------------+
| 2017-01-11 | 10 |
+------------------+--------------------+
| 2017-01-11 | 11 |
+------------------+--------------------+
| 2017-01-12 | 12 |
+------------------+--------------------+
| 2017-01-13 | 13 |
+------------------+--------------------+
| 2017-01-14 | 14 |
+------------------+--------------------+
| 2017-01-17 | 15 |
+------------------+--------------------+
select date
,volume
,avg (volume) over
(
order by date
range between 6 preceding and current row
) as moving_avg
from job_history
;
+------------+--------+------------+
| date | volume | moving_avg |
+------------+--------+------------+
| 2017-01-01 | 1 | 1.0 |
+------------+--------+------------+
| 2017-01-02 | 2 | 1.5 |
+------------+--------+------------+
| 2017-01-05 | 3 | 2.0 |
+------------+--------+------------+
| 2017-01-06 | 4 | 2.5 |
+------------+--------+------------+
| 2017-01-08 | 5 | 3.5 |
+------------+--------+------------+
| 2017-01-09 | 6 | 4.5 |
+------------+--------+------------+
| 2017-01-10 | 8 | 6.0 |
+------------+--------+------------+
| 2017-01-10 | 9 | 6.0 |
+------------+--------+------------+
| 2017-01-10 | 7 | 6.0 |
+------------+--------+------------+
| 2017-01-11 | 10 | 7.0 |
+------------+--------+------------+
| 2017-01-11 | 11 | 7.0 |
+------------+--------+------------+
| 2017-01-12 | 12 | 8.0 |
+------------+--------+------------+
| 2017-01-13 | 13 | 9.0 |
+------------+--------+------------+
| 2017-01-14 | 14 | 9.5 |
+------------+--------+------------+
| 2017-01-17 | 15 | 12.5 |
+------------+--------+------------+
正如 David 上面提到的,您可以使用 range between。但是,没有 INTERVAL 函数。相反,我们可以先将日期传输到 ts。
AVG(metric)
OVER (PARTITION BY author_id
ORDER BY unix_timestamp(to_date(date))
RANGE BETWEEN 604800 PRECEDING AND 86400 PRECEDING
) AS metric_rolling_7_day_average
我正在使用 Hive 处理庞大的数据集,并尝试过去一周的滚动平均值。如果一天的数据缺失,我们要考虑 6 天的滚动平均值。 自连接需要很长时间,所以尝试了 window 功能。
例如
Select date,avg(volume) over (order by date ROWS between 6 preceding AND current row) as Moving_AVG
From job_history;
无论如何,这可以通过配置单元 window 功能完成吗?
Range between 6 preceding and current row
select date
,volume
,avg (volume) over
(
order by date
range between 6 preceding and current row
) as moving_avg
from job_history
;
演示
create table job_history (date date,volume int);
insert into job_history values
('2017-01-01', 1),('2017-01-02', 2),('2017-01-05', 3),('2017-01-06', 4),('2017-01-08', 5)
,('2017-01-09', 6),('2017-01-10', 7),('2017-01-10', 8),('2017-01-10', 9),('2017-01-11',10)
,('2017-01-11',11),('2017-01-12',12),('2017-01-13',13),('2017-01-14',14),('2017-01-17',15)
;
select * from job_history
;
+------------------+--------------------+
| job_history.date | job_history.volume |
+------------------+--------------------+
| 2017-01-01 | 1 |
+------------------+--------------------+
| 2017-01-02 | 2 |
+------------------+--------------------+
| 2017-01-05 | 3 |
+------------------+--------------------+
| 2017-01-06 | 4 |
+------------------+--------------------+
| 2017-01-08 | 5 |
+------------------+--------------------+
| 2017-01-09 | 6 |
+------------------+--------------------+
| 2017-01-10 | 7 |
+------------------+--------------------+
| 2017-01-10 | 8 |
+------------------+--------------------+
| 2017-01-10 | 9 |
+------------------+--------------------+
| 2017-01-11 | 10 |
+------------------+--------------------+
| 2017-01-11 | 11 |
+------------------+--------------------+
| 2017-01-12 | 12 |
+------------------+--------------------+
| 2017-01-13 | 13 |
+------------------+--------------------+
| 2017-01-14 | 14 |
+------------------+--------------------+
| 2017-01-17 | 15 |
+------------------+--------------------+
select date
,volume
,avg (volume) over
(
order by date
range between 6 preceding and current row
) as moving_avg
from job_history
;
+------------+--------+------------+
| date | volume | moving_avg |
+------------+--------+------------+
| 2017-01-01 | 1 | 1.0 |
+------------+--------+------------+
| 2017-01-02 | 2 | 1.5 |
+------------+--------+------------+
| 2017-01-05 | 3 | 2.0 |
+------------+--------+------------+
| 2017-01-06 | 4 | 2.5 |
+------------+--------+------------+
| 2017-01-08 | 5 | 3.5 |
+------------+--------+------------+
| 2017-01-09 | 6 | 4.5 |
+------------+--------+------------+
| 2017-01-10 | 8 | 6.0 |
+------------+--------+------------+
| 2017-01-10 | 9 | 6.0 |
+------------+--------+------------+
| 2017-01-10 | 7 | 6.0 |
+------------+--------+------------+
| 2017-01-11 | 10 | 7.0 |
+------------+--------+------------+
| 2017-01-11 | 11 | 7.0 |
+------------+--------+------------+
| 2017-01-12 | 12 | 8.0 |
+------------+--------+------------+
| 2017-01-13 | 13 | 9.0 |
+------------+--------+------------+
| 2017-01-14 | 14 | 9.5 |
+------------+--------+------------+
| 2017-01-17 | 15 | 12.5 |
+------------+--------+------------+
正如 David 上面提到的,您可以使用 range between。但是,没有 INTERVAL 函数。相反,我们可以先将日期传输到 ts。
AVG(metric)
OVER (PARTITION BY author_id
ORDER BY unix_timestamp(to_date(date))
RANGE BETWEEN 604800 PRECEDING AND 86400 PRECEDING
) AS metric_rolling_7_day_average