Impala Last_Value() 没有给出预期的结果
Impala Last_Value() Not giving result as expected
我在 Impala 中有一个 Table,其中我有时间信息作为 Unix 时间(频率为 1 毫秒)和关于三个变量的信息,如下所示:
ts Val1 Val2 Val3
1.60669E+12 7541.76 0.55964607 267.1613
1.60669E+12 7543.04 0.5607262 267.27805
1.60669E+12 7543.04 0.5607241 267.22308
1.60669E+12 7543.6797 0.56109643 267.25974
1.60669E+12 7543.6797 0.56107396 267.30624
1.60669E+12 7543.6797 0.56170875 267.2643
我想对数据重新采样并获取新时间的最后一个值 window。例如,如果我想以 10Sec 频率重新采样,则输出应该是 10Sec window 的最后一个值,如下所示:
ts val1_Last Val2_Last Val3_Last
2020-11-29 22:30:00 7541.76 0.55964607 267.1613
2020-11-29 22:30:10 7542.3994 0.5613486 267.31238
2020-11-29 22:30:20 7542.3994 0.5601791 267.22842
2020-11-29 22:30:30 7544.32 0.56069416 267.20248
为了得到这个结果,我运行下面的查询:
select distinct *
from (
select ts,
last_value(Val1) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val1,
last_value(Val2) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val2,
last_value(Val3) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val3
from (SELECT cast(cast(unix_timestamp(cast(ts/1000 as TIMESTAMP))/10 as bigint)*10 as TIMESTAMP) as ts ,
Val1 as Val1,
Val2 as Val2,
Val3 as Val3
FROM Sensor_Data.Table where unit='Unit1'
and cast(ts/1000 as TIMESTAMP) BETWEEN '2020-11-29 22:30:00' and '2020-12-01 01:51:00') as ttt) as tttt
order by ts
我在一些论坛上看到 LAST_VALUE()
有时会导致问题,所以我尝试使用 FIRST_VALUE
和 ORDER BY DESC
来达到同样的目的。查询如下:
select distinct *
from (
select ts,
first_value(Val1) over (partition by ts order by ts desc rows between unbounded preceding and unbounded following) as Val1,
first_value(Val2) over (partition by ts order by ts desc rows between unbounded preceding and unbounded following) as Val2,
first_value(Val3) over (partition by ts order by ts desc rows between unbounded preceding and unbounded following) as Val3
from (SELECT cast(cast(unix_timestamp(cast(ts/1000 as TIMESTAMP))/10 as bigint)*10 as TIMESTAMP) as ts ,
Val1 as Val1,
val2 as Val2,
Val3 as Val3
FROM product_sofcdtw_ops.as_operated_full_backup where unit='FCS05-09'
and cast(ts/1000 as TIMESTAMP) BETWEEN '2020-11-29 22:30:00' and '2020-12-01 01:51:00') as ttt) as tttt
order by ts
但是在这两种情况下,我都没有得到预期的结果。重采样时间 ts
按预期出现(window 为 10 秒)但我得到的 Val1
、Val2
和 Val3
的随机值介于 0-9 秒之间, 10-19 秒, ... windows.
从逻辑上看,这个查询看起来不错,我没有发现任何问题。谁能解释一下为什么我使用这个查询没有得到正确的答案。
谢谢!!!
问题出在这一行:
last_value(Val1) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val1,
您正在按同一列进行分区和排序,ts
- 因此没有排序(或者更具体地说,按整个分区中不变的值排序会导致任意排序)。您需要保留 original ts 才能完成这项工作,使用它进行订购:
select ts,
last_value(Val1) over (partition by ts_10 order by ts rows between unbounded preceding and unbounded following) as Val1,
last_value(Val2) over (partition by ts_10 order by ts rows between unbounded preceding and unbounded following) as Val2,
last_value(Val3) over (partition by ts_10 order by ts rows between unbounded preceding and unbounded following) as Val3
from (SELECT cast(cast(unix_timestamp(cast(ts/1000 as TIMESTAMP))/10 as bigint)*10 as TIMESTAMP) as ts_10,
t.*
FROM Sensor_Data.Table t
WHERE unit = 'Unit1' AND
cast(ts/1000 as TIMESTAMP) BETWEEN '2020-11-29 22:30:00' and '2020-12-01 01:51:00'
) t
顺便说一句,last_value()
的问题是当您省略 window 框架(rows
或 range
部分时,它会出现意外行为 window 函数说明).
问题是默认规范是 range between unbounded preceding and current row
,这意味着 last_value()
只是选取当前行中的值。
另一方面,first_value()
在默认框架下工作正常。但是,如果您包含显式框架,则两者是等效的。
我在 Impala 中有一个 Table,其中我有时间信息作为 Unix 时间(频率为 1 毫秒)和关于三个变量的信息,如下所示:
ts Val1 Val2 Val3
1.60669E+12 7541.76 0.55964607 267.1613
1.60669E+12 7543.04 0.5607262 267.27805
1.60669E+12 7543.04 0.5607241 267.22308
1.60669E+12 7543.6797 0.56109643 267.25974
1.60669E+12 7543.6797 0.56107396 267.30624
1.60669E+12 7543.6797 0.56170875 267.2643
我想对数据重新采样并获取新时间的最后一个值 window。例如,如果我想以 10Sec 频率重新采样,则输出应该是 10Sec window 的最后一个值,如下所示:
ts val1_Last Val2_Last Val3_Last
2020-11-29 22:30:00 7541.76 0.55964607 267.1613
2020-11-29 22:30:10 7542.3994 0.5613486 267.31238
2020-11-29 22:30:20 7542.3994 0.5601791 267.22842
2020-11-29 22:30:30 7544.32 0.56069416 267.20248
为了得到这个结果,我运行下面的查询:
select distinct *
from (
select ts,
last_value(Val1) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val1,
last_value(Val2) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val2,
last_value(Val3) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val3
from (SELECT cast(cast(unix_timestamp(cast(ts/1000 as TIMESTAMP))/10 as bigint)*10 as TIMESTAMP) as ts ,
Val1 as Val1,
Val2 as Val2,
Val3 as Val3
FROM Sensor_Data.Table where unit='Unit1'
and cast(ts/1000 as TIMESTAMP) BETWEEN '2020-11-29 22:30:00' and '2020-12-01 01:51:00') as ttt) as tttt
order by ts
我在一些论坛上看到 LAST_VALUE()
有时会导致问题,所以我尝试使用 FIRST_VALUE
和 ORDER BY DESC
来达到同样的目的。查询如下:
select distinct *
from (
select ts,
first_value(Val1) over (partition by ts order by ts desc rows between unbounded preceding and unbounded following) as Val1,
first_value(Val2) over (partition by ts order by ts desc rows between unbounded preceding and unbounded following) as Val2,
first_value(Val3) over (partition by ts order by ts desc rows between unbounded preceding and unbounded following) as Val3
from (SELECT cast(cast(unix_timestamp(cast(ts/1000 as TIMESTAMP))/10 as bigint)*10 as TIMESTAMP) as ts ,
Val1 as Val1,
val2 as Val2,
Val3 as Val3
FROM product_sofcdtw_ops.as_operated_full_backup where unit='FCS05-09'
and cast(ts/1000 as TIMESTAMP) BETWEEN '2020-11-29 22:30:00' and '2020-12-01 01:51:00') as ttt) as tttt
order by ts
但是在这两种情况下,我都没有得到预期的结果。重采样时间 ts
按预期出现(window 为 10 秒)但我得到的 Val1
、Val2
和 Val3
的随机值介于 0-9 秒之间, 10-19 秒, ... windows.
从逻辑上看,这个查询看起来不错,我没有发现任何问题。谁能解释一下为什么我使用这个查询没有得到正确的答案。
谢谢!!!
问题出在这一行:
last_value(Val1) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val1,
您正在按同一列进行分区和排序,ts
- 因此没有排序(或者更具体地说,按整个分区中不变的值排序会导致任意排序)。您需要保留 original ts 才能完成这项工作,使用它进行订购:
select ts,
last_value(Val1) over (partition by ts_10 order by ts rows between unbounded preceding and unbounded following) as Val1,
last_value(Val2) over (partition by ts_10 order by ts rows between unbounded preceding and unbounded following) as Val2,
last_value(Val3) over (partition by ts_10 order by ts rows between unbounded preceding and unbounded following) as Val3
from (SELECT cast(cast(unix_timestamp(cast(ts/1000 as TIMESTAMP))/10 as bigint)*10 as TIMESTAMP) as ts_10,
t.*
FROM Sensor_Data.Table t
WHERE unit = 'Unit1' AND
cast(ts/1000 as TIMESTAMP) BETWEEN '2020-11-29 22:30:00' and '2020-12-01 01:51:00'
) t
顺便说一句,last_value()
的问题是当您省略 window 框架(rows
或 range
部分时,它会出现意外行为 window 函数说明).
问题是默认规范是 range between unbounded preceding and current row
,这意味着 last_value()
只是选取当前行中的值。
另一方面,first_value()
在默认框架下工作正常。但是,如果您包含显式框架,则两者是等效的。