在 Impala 中将聚合函数与重采样相结合

Combining Aggregate Function with resampling in Impala

我在 Hadoop 中有 Table,其中我有不同传感器单元的数据,采样时间 ts 为 1 毫秒。我可以使用 Impala 中的以下查询使用不同聚合函数的组合对单个单元的数据重新采样(假设我想使用 LAST_VALUE() 作为聚合函数每 5 分钟对数据重新采样):

SELECT DISTINCT * 
from ( select ts_resample, unit,
last_value(Val1) over (partition by ts_resample order by ts rows between unbounded preceding and unbounded following) as Val1, 
last_value(Val2) over (partition by ts_resample order by ts rows between unbounded preceding and unbounded following) as Val2
from (
SELECT cast(cast(unix_timestamp(cast(ts/1000 as TIMESTAMP))/300 as bigint)*300 as TIMESTAMP) as ts_resample, 
ts as ts, unit as unit, Val1 as Val1, Val2 as Val2
FROM Sensor_Data.Table1 WHERE unit='Unit1') as t) as tt

如果我 运行 这个查询是针对单个单元的,那么我得到的答案是正确的,没有问题。

但是如果我想根据某些聚合函数对每个单元的数据重新采样,例如LAST_VALUE() 然后我得到了错误的答案,并且每个单元的重采样结果相同,尽管每个单元的数据不同。我是 运行 的查询在下面给出,我没有在 WHERE 子句中指定任何单位:

SELECT DISTINCT * 
from(
select ts_resample, unit,
last_value(Val1) over (partition by ts_resample order by ts rows between unbounded preceding and unbounded following) as Val1, 
last_value(Val2) over (partition by ts_resample order by ts rows between unbounded preceding and unbounded following) as Val2
from (
SELECT cast(cast(unix_timestamp(cast(ts/1000 as TIMESTAMP))/300 as bigint)*300 as TIMESTAMP) as ts_resample, 
ts as ts, unit as unit, Val1 as Val1, Val2 as Val2
FROM Sensor_Data.Table1) as t) as tt

使用上述查询得到的数据中当前三个单位的结果如下:

ts_resample             unit    Val1    Val2
2020-12-01 00:00:00     unit1   0.8974  10.485
2020-12-01 00:00:00     unit2   0.8974  10.485
2020-12-01 00:00:00     unit3   0.8974  10.485
2020-12-01 00:05:00     unit1   0.9041  11.854
2020-12-01 00:05:00     unit2   0.9041  11.854
2020-12-01 00:05:00     unit3   0.9041  11.854

实际上我想获取每个单元的最后一个值,每个单元都不同。如下所示:

ts_resample             unit    Val1    Val2
2020-12-01 00:00:00     unit1   0.8974  10.485
2020-12-01 00:00:00     unit2   0.9014  11.954
2020-12-01 00:00:00     unit3   0.7854  10.821
2020-12-01 00:05:00     unit1   0.9841  11.125
2020-12-01 00:05:00     unit2   0.8742  10.963
2020-12-01 00:05:00     unit3   0.9632  11.784

有人能告诉我我的查询有什么问题吗?

谢谢

我通过 ts_resample 在分区中提供单元信息解决了这个问题。最终解决方案如下:

SELECT DISTINCT * 
from(
select ts_resample, unit,
last_value(Val1) over (partition by ts_resample, unit order by ts rows between unbounded preceding and unbounded following) as Val1, 
last_value(Val2) over (partition by ts_resample, unit order by ts rows between unbounded preceding and unbounded following) as Val2
from (
SELECT cast(cast(unix_timestamp(cast(ts/1000 as TIMESTAMP))/300 as bigint)*300 as TIMESTAMP) as ts_resample, 
ts as ts, unit as unit, Val1 as Val1, Val2 as Val2
FROM Sensor_Data.Table1) as t) as tt

在此之后我得到了我想要的结果并在我的问题中显示。