Impala Last_Value() 没有给出预期的结果

Impala Last_Value() Not giving result as expected

我在 Impala 中有一个 Table,其中我有时间信息作为 Unix 时间(频率为 1 毫秒)和关于三个变量的信息,如下所示:

ts          Val1        Val2        Val3        
1.60669E+12 7541.76     0.55964607  267.1613        
1.60669E+12 7543.04     0.5607262   267.27805       
1.60669E+12 7543.04     0.5607241   267.22308       
1.60669E+12 7543.6797   0.56109643  267.25974       
1.60669E+12 7543.6797   0.56107396  267.30624       
1.60669E+12 7543.6797   0.56170875  267.2643    

我想对数据重新采样并获取新时间的最后一个值 window。例如,如果我想以 10Sec 频率重新采样,则输出应该是 10Sec window 的最后一个值,如下所示:

ts                      val1_Last       Val2_Last       Val3_Last   
2020-11-29 22:30:00     7541.76         0.55964607      267.1613
2020-11-29 22:30:10     7542.3994       0.5613486       267.31238
2020-11-29 22:30:20     7542.3994       0.5601791       267.22842
2020-11-29 22:30:30     7544.32         0.56069416      267.20248

为了得到这个结果,我运行下面的查询:

select distinct *
from (
select ts,
last_value(Val1) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val1, 
last_value(Val2) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val2,
last_value(Val3) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val3 
from (SELECT cast(cast(unix_timestamp(cast(ts/1000 as TIMESTAMP))/10 as bigint)*10 as TIMESTAMP) as ts , 
Val1 as Val1, 
Val2 as Val2, 
Val3 as Val3
FROM Sensor_Data.Table where unit='Unit1'  
and cast(ts/1000 as TIMESTAMP) BETWEEN '2020-11-29 22:30:00' and '2020-12-01 01:51:00') as ttt) as tttt 
order by ts

我在一些论坛上看到 LAST_VALUE() 有时会导致问题,所以我尝试使用 FIRST_VALUEORDER BY DESC 来达到同样的目的。查询如下:

select distinct *
from (
select ts,
first_value(Val1) over (partition by ts order by ts desc rows between unbounded preceding and unbounded following) as Val1, 
first_value(Val2) over (partition by ts order by ts desc rows between unbounded preceding and unbounded following) as Val2,
first_value(Val3) over (partition by ts order by ts desc rows between unbounded preceding and unbounded following) as Val3
from (SELECT cast(cast(unix_timestamp(cast(ts/1000 as TIMESTAMP))/10 as bigint)*10 as TIMESTAMP) as ts , 
Val1 as Val1, 
val2 as Val2, 
Val3 as Val3
FROM product_sofcdtw_ops.as_operated_full_backup where unit='FCS05-09'  
and cast(ts/1000 as TIMESTAMP) BETWEEN '2020-11-29 22:30:00' and '2020-12-01 01:51:00') as ttt) as tttt 
order by ts

但是在这两种情况下,我都没有得到预期的结果。重采样时间 ts 按预期出现(window 为 10 秒)但我得到的 Val1Val2Val3 的随机值介于 0-9 秒之间, 10-19 秒, ... windows.

从逻辑上看,这个查询看起来不错,我没有发现任何问题。谁能解释一下为什么我使用这个查询没有得到正确的答案。

谢谢!!!

问题出在这一行:

last_value(Val1) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val1, 

您正在按同一列进行分区和排序,ts - 因此没有排序(或者更具体地说,按整个分区中不变的值排序会导致任意排序)。您需要保留 original ts 才能完成这项工作,使用它进行订购:

select ts,
        last_value(Val1) over (partition by ts_10 order by ts rows between unbounded preceding and unbounded following) as Val1, 
        last_value(Val2) over (partition by ts_10 order by ts rows between unbounded preceding and unbounded following) as Val2,
        last_value(Val3) over (partition by ts_10 order by ts rows between unbounded preceding and unbounded following) as Val3 
from (SELECT cast(cast(unix_timestamp(cast(ts/1000 as TIMESTAMP))/10 as bigint)*10 as TIMESTAMP) as ts_10, 
             t.*
      FROM Sensor_Data.Table t
      WHERE unit = 'Unit1' AND
            cast(ts/1000 as TIMESTAMP) BETWEEN '2020-11-29 22:30:00' and '2020-12-01 01:51:00'
     ) t

顺便说一句,last_value() 的问题是当您省略 window 框架(rowsrange 部分时,它会出现意外行为 window 函数说明).

问题是默认规范是 range between unbounded preceding and current row,这意味着 last_value() 只是选取当前行中的值。

另一方面,first_value() 在默认框架下工作正常。但是,如果您包含显式框架,则两者是等效的。