时间 window 定义的 start/stop 事件中列的总和值
Sum values of column in time window defined start/stop event
我正在努力在 Snowflake 中编写 SQL window 函数来对列中的连续值求和。
stg_events
中的数据table:
robot_id
timestamp
msg_type
obj_count
1
2020-12-14 09:30:00.000
route_start
NULL
1
2020-12-14 09:30:00.100
object_detected
2
1
2020-12-14 09:30:00.300
object_detected
1
1
2020-12-14 09:30:05.000
object_detected
2
1
2020-12-14 09:30:40.000
route_stop
NULL
SQL 语句的期望输出,我正在尝试编写:
robot_id
route_id
route_start
route_stop
sum_obj
1
1
2020-12-14 09:30:00.000
2020-12-14 09:30:40.000
5
我仅提供了一个机器人一条路线的示例,但会有更多机器人将数据推送到 table 以及更多路线。
非常感谢任何想法!
您可以使用GROUP BY
和聚合函数如下:
select robot_id,
route_id,
min(timestamp) as route_start,
max(timestamp) as route_end,
sum(obj_count) as obj_count
from stg_events t
group by robot_id, route_id
假设同一个机器人可能有多条路线,光靠聚合是解决不了问题的。这是一个间隙和孤岛问题,其中一个孤岛以“route_start”消息类型开始,以“route_stop”结束。
如果启动和停止正确交错,这里是使用 window 函数的方法:
select robot_id, min(timestamp) as route_start, max(timestamp) as route_end, sum(obj_count) as obj_count
from (
select t.*,
sum(case when msg_type = 'route_start' then 1 else 0 end) over(partition by robot_id order by timestamp) as cnt_start,
sum(case when msg_type = 'route_stop' then 1 else 0 end) over(partition by robot_id order by timestamp rows between unbounded preceding and 1 preceding) as cnt_end
from mytable t
) t
where cnt_start = coalesce(cnt_end, 0) + 1
group by robot_id, cnt_start
我们的想法是计算开始(包括当前行)和停止(直到前一行)并比较这两个值以识别孤岛。剩下的只是聚合。
这里是a demo,样本数据较少:
robot_id | timestamp | msg_type | obj_count
-------: | :-------------------- | :-------------- | --------:
1 | 2020-12-14 09:30:00 | route_start | null
1 | 2020-12-14 09:30:00.1 | object_detected | 2
1 | 2020-12-14 09:30:00.3 | object_detected | 1
1 | 2020-12-14 09:30:05 | object_detected | 2
1 | 2020-12-14 09:30:40 | route_stop | null
1 | 2020-12-15 00:30:00 | route_start | null
1 | 2020-12-15 00:30:05 | object_detected | 2
1 | 2020-12-15 00:30:40 | route_stop | null
结果:
robot_id | route_start | route_end | obj_count
-------: | :------------------ | :------------------ | --------:
1 | 2020-12-14 09:30:00 | 2020-12-14 09:30:40 | 5
1 | 2020-12-15 00:30:00 | 2020-12-15 00:30:40 | 2
如果机器人有多个启动项,您希望为每个组分配一个分组 -- 您称之为 route_id
。最简单的方法是“开始”的累加和。然后汇总:
select robot_id, route_id,
min(timestamp) as route_start,
max(timestamp) as route_end,
sum(obj_count) as obj_count
from (select e.*,
sum(case when msg_type = 'route_start' then 1 else 0 end) over (partition by robot_id order by timestamp) as route_id
from stg_events e
group by robot_id, route_id;
注意:这里不考虑'route_end'
。当那不是下一次开始前的最后一行时,不清楚你想要什么。
我正在努力在 Snowflake 中编写 SQL window 函数来对列中的连续值求和。
stg_events
中的数据table:
robot_id | timestamp | msg_type | obj_count |
---|---|---|---|
1 | 2020-12-14 09:30:00.000 | route_start | NULL |
1 | 2020-12-14 09:30:00.100 | object_detected | 2 |
1 | 2020-12-14 09:30:00.300 | object_detected | 1 |
1 | 2020-12-14 09:30:05.000 | object_detected | 2 |
1 | 2020-12-14 09:30:40.000 | route_stop | NULL |
SQL 语句的期望输出,我正在尝试编写:
robot_id | route_id | route_start | route_stop | sum_obj |
---|---|---|---|---|
1 | 1 | 2020-12-14 09:30:00.000 | 2020-12-14 09:30:40.000 | 5 |
我仅提供了一个机器人一条路线的示例,但会有更多机器人将数据推送到 table 以及更多路线。
非常感谢任何想法!
您可以使用GROUP BY
和聚合函数如下:
select robot_id,
route_id,
min(timestamp) as route_start,
max(timestamp) as route_end,
sum(obj_count) as obj_count
from stg_events t
group by robot_id, route_id
假设同一个机器人可能有多条路线,光靠聚合是解决不了问题的。这是一个间隙和孤岛问题,其中一个孤岛以“route_start”消息类型开始,以“route_stop”结束。
如果启动和停止正确交错,这里是使用 window 函数的方法:
select robot_id, min(timestamp) as route_start, max(timestamp) as route_end, sum(obj_count) as obj_count
from (
select t.*,
sum(case when msg_type = 'route_start' then 1 else 0 end) over(partition by robot_id order by timestamp) as cnt_start,
sum(case when msg_type = 'route_stop' then 1 else 0 end) over(partition by robot_id order by timestamp rows between unbounded preceding and 1 preceding) as cnt_end
from mytable t
) t
where cnt_start = coalesce(cnt_end, 0) + 1
group by robot_id, cnt_start
我们的想法是计算开始(包括当前行)和停止(直到前一行)并比较这两个值以识别孤岛。剩下的只是聚合。
这里是a demo,样本数据较少:
robot_id | timestamp | msg_type | obj_count -------: | :-------------------- | :-------------- | --------: 1 | 2020-12-14 09:30:00 | route_start | null 1 | 2020-12-14 09:30:00.1 | object_detected | 2 1 | 2020-12-14 09:30:00.3 | object_detected | 1 1 | 2020-12-14 09:30:05 | object_detected | 2 1 | 2020-12-14 09:30:40 | route_stop | null 1 | 2020-12-15 00:30:00 | route_start | null 1 | 2020-12-15 00:30:05 | object_detected | 2 1 | 2020-12-15 00:30:40 | route_stop | null
结果:
robot_id | route_start | route_end | obj_count -------: | :------------------ | :------------------ | --------: 1 | 2020-12-14 09:30:00 | 2020-12-14 09:30:40 | 5 1 | 2020-12-15 00:30:00 | 2020-12-15 00:30:40 | 2
如果机器人有多个启动项,您希望为每个组分配一个分组 -- 您称之为 route_id
。最简单的方法是“开始”的累加和。然后汇总:
select robot_id, route_id,
min(timestamp) as route_start,
max(timestamp) as route_end,
sum(obj_count) as obj_count
from (select e.*,
sum(case when msg_type = 'route_start' then 1 else 0 end) over (partition by robot_id order by timestamp) as route_id
from stg_events e
group by robot_id, route_id;
注意:这里不考虑'route_end'
。当那不是下一次开始前的最后一行时,不清楚你想要什么。