时间 window 定义的 start/stop 事件中列的总和值

Sum values of column in time window defined start/stop event

我正在努力在 Snowflake 中编写 SQL window 函数来对列中的连续值求和。

stg_events中的数据table:

robot_id timestamp msg_type obj_count
1 2020-12-14 09:30:00.000 route_start NULL
1 2020-12-14 09:30:00.100 object_detected 2
1 2020-12-14 09:30:00.300 object_detected 1
1 2020-12-14 09:30:05.000 object_detected 2
1 2020-12-14 09:30:40.000 route_stop NULL

SQL 语句的期望输出,我正在尝试编写:

robot_id route_id route_start route_stop sum_obj
1 1 2020-12-14 09:30:00.000 2020-12-14 09:30:40.000 5

我仅提供了一个机器人一条路线的示例,但会有更多机器人将数据推送到 table 以及更多路线。

非常感谢任何想法!

您可以使用GROUP BY和聚合函数如下:

select robot_id, 
       route_id,
       min(timestamp) as route_start,
       max(timestamp) as route_end,
       sum(obj_count) as obj_count
  from stg_events t
group by robot_id, route_id

假设同一个机器人可能有多条路线,光靠聚合是解决不了问题的。这是一个间隙和孤岛问题,其中一个孤岛以“route_start”消息类型开始,以“route_stop”结束。

如果启动和停止正确交错,这里是使用 window 函数的方法:

select robot_id, min(timestamp) as route_start, max(timestamp) as route_end, sum(obj_count) as obj_count
from (
    select t.*,
        sum(case when msg_type = 'route_start' then 1 else 0 end) over(partition by robot_id order by timestamp) as cnt_start,
        sum(case when msg_type = 'route_stop'  then 1 else 0 end) over(partition by robot_id order by timestamp rows between unbounded preceding and 1 preceding) as cnt_end
    from mytable t
) t
where cnt_start = coalesce(cnt_end, 0) + 1
group by robot_id, cnt_start

我们的想法是计算开始(包括当前行)和停止(直到前一行)并比较这两个值以识别孤岛。剩下的只是聚合。

这里是a demo,样本数据较少:

robot_id | timestamp             | msg_type        | obj_count
-------: | :-------------------- | :-------------- | --------:
       1 | 2020-12-14 09:30:00   | route_start     |      null
       1 | 2020-12-14 09:30:00.1 | object_detected |         2
       1 | 2020-12-14 09:30:00.3 | object_detected |         1
       1 | 2020-12-14 09:30:05   | object_detected |         2
       1 | 2020-12-14 09:30:40   | route_stop      |      null
       1 | 2020-12-15 00:30:00   | route_start     |      null
       1 | 2020-12-15 00:30:05   | object_detected |         2
       1 | 2020-12-15 00:30:40   | route_stop      |      null

结果:

robot_id | route_start         | route_end           | obj_count
-------: | :------------------ | :------------------ | --------:
       1 | 2020-12-14 09:30:00 | 2020-12-14 09:30:40 |         5
       1 | 2020-12-15 00:30:00 | 2020-12-15 00:30:40 |         2

如果机器人有多个启动项,您希望为每个组分配一个分组 -- 您称之为 route_id。最简单的方法是“开始”的累加和。然后汇总:

select robot_id, route_id,
       min(timestamp) as route_start,
       max(timestamp) as route_end,
       sum(obj_count) as obj_count
from (select e.*,
             sum(case when msg_type = 'route_start' then 1 else 0 end) over (partition by robot_id order by timestamp) as route_id
      from stg_events e
group by robot_id, route_id;

注意:这里不考虑'route_end'。当那不是下一次开始前的最后一行时,不清楚你想要什么。