时间 window 定义的 start/stop 事件中列的总和值

Question

我正在努力在 Snowflake 中编写 SQL window 函数来对列中的连续值求和。

stg_events中的数据table:

robot_id	timestamp	msg_type	obj_count
1	2020-12-14 09:30:00.000	route_start	NULL
1	2020-12-14 09:30:00.100	object_detected	2
1	2020-12-14 09:30:00.300	object_detected	1
1	2020-12-14 09:30:05.000	object_detected	2
1	2020-12-14 09:30:40.000	route_stop	NULL

SQL 语句的期望输出，我正在尝试编写：

robot_id	route_id	route_start	route_stop	sum_obj
1	1	2020-12-14 09:30:00.000	2020-12-14 09:30:40.000	5

我仅提供了一个机器人一条路线的示例，但会有更多机器人将数据推送到 table 以及更多路线。

非常感谢任何想法！

Answer 1

您可以使用GROUP BY和聚合函数如下：

select robot_id, 
       route_id,
       min(timestamp) as route_start,
       max(timestamp) as route_end,
       sum(obj_count) as obj_count
  from stg_events t
group by robot_id, route_id

Answer 2

假设同一个机器人可能有多条路线，光靠聚合是解决不了问题的。这是一个间隙和孤岛问题，其中一个孤岛以“route_start”消息类型开始，以“route_stop”结束。

如果启动和停止正确交错，这里是使用 window 函数的方法：

select robot_id, min(timestamp) as route_start, max(timestamp) as route_end, sum(obj_count) as obj_count
from (
    select t.*,
        sum(case when msg_type = 'route_start' then 1 else 0 end) over(partition by robot_id order by timestamp) as cnt_start,
        sum(case when msg_type = 'route_stop'  then 1 else 0 end) over(partition by robot_id order by timestamp rows between unbounded preceding and 1 preceding) as cnt_end
    from mytable t
) t
where cnt_start = coalesce(cnt_end, 0) + 1
group by robot_id, cnt_start

我们的想法是计算开始（包括当前行）和停止（直到前一行）并比较这两个值以识别孤岛。剩下的只是聚合。

这里是a demo，样本数据较少：

robot_id | timestamp             | msg_type        | obj_count
-------: | :-------------------- | :-------------- | --------:
       1 | 2020-12-14 09:30:00   | route_start     |      null
       1 | 2020-12-14 09:30:00.1 | object_detected |         2
       1 | 2020-12-14 09:30:00.3 | object_detected |         1
       1 | 2020-12-14 09:30:05   | object_detected |         2
       1 | 2020-12-14 09:30:40   | route_stop      |      null
       1 | 2020-12-15 00:30:00   | route_start     |      null
       1 | 2020-12-15 00:30:05   | object_detected |         2
       1 | 2020-12-15 00:30:40   | route_stop      |      null

结果：

robot_id | route_start         | route_end           | obj_count
-------: | :------------------ | :------------------ | --------:
       1 | 2020-12-14 09:30:00 | 2020-12-14 09:30:40 |         5
       1 | 2020-12-15 00:30:00 | 2020-12-15 00:30:40 |         2

Answer 3

如果机器人有多个启动项，您希望为每个组分配一个分组 -- 您称之为 route_id。最简单的方法是“开始”的累加和。然后汇总：

select robot_id, route_id,
       min(timestamp) as route_start,
       max(timestamp) as route_end,
       sum(obj_count) as obj_count
from (select e.*,
             sum(case when msg_type = 'route_start' then 1 else 0 end) over (partition by robot_id order by timestamp) as route_id
      from stg_events e
group by robot_id, route_id;

注意：这里不考虑'route_end'。当那不是下一次开始前的最后一行时，不清楚你想要什么。

时间 window 定义的 start/stop 事件中列的总和值

Sum values of column in time window defined start/stop event

sql

datetime

aggregate-functions

gaps-and-islands

snowflake-cloud-data-platform