SQL 基于时间和列变化的聚合
SQL aggregation based on time and column change
遇到这样一种情况,我正在尝试计算开始时间、结束时间以及在某个位置的不同区域花费的时间。
我有系统捕捉到的带有时间戳和人所在位置的数据。
正常情况是位置发生变化,在这种情况下,结束时间应该是之前看到的值。例外情况是当一个人在 5 米或更远的距离内未被看到时,在这种情况下,endTime 应是最后一次看到的(请参阅上面所需的聚合输出的第 2 行和第 3 行)。
原始数据
date, zone
8h10m, room1
8h12m, room1
8h15m, hall
8h16m, hall
8h25m, hall
8h29m, hall
8h30m, room2
8h34m, room2
8h38m, room2
8h42m, room2
Aggregation/Summary 需要以下方式(或类似方式):
startDate, endDate, time, zone
8h10m, 8h12m, 3m, room1
8h15m, 8h16m, 2m, hall <-- special case time >5m
8h25m, 8h29m, 5m, hall
8h30, 8h42m, 9n, room2
你能告诉我如何在 SQL 中制作这样的“aggregation/summary”吗?我正在使用 BigQuery,但我相信标准 SQL 应该可以完成这项工作。
谢谢,
芮
从 Mikhail Berlyant 解决方案中,countif
的概念已用于简化此查询。这个答案能够识别一举一动,如果这个人 re-eneters 在 5 分钟内进入一个房间也是如此。请参阅 table.
中提供的附加数据
需要几个步骤:
- 当 5 分钟内没有数据时,添加带有区域
---
的行:当与前一行 (lag
) 与当前行的差异超过 5 分钟时,设置 over_5:minutes
为真。 unnest([0,1]) as x
复制数据集,qualify
在这种情况下包含数据集。
- 按
over(order by date, x)
中的 date, x
列对以下所有语句进行排序
- 用
lag
获取最后一个房间和最后一个日期。因为unnest x,往后看两行
- 将上一个房间与当前房间进行比较,如果它们不同,则将
zone_change
设置为 true。
countif(zone_change)
从第 1 天到当前日期获得 zone_id
。这对应于单个区域。
- 为此
zone_id
计算x
为0的时间; 5分钟没给位就是这种情况
group by zone_id
并计算最小和最大日期
- 通过过滤删除
---
个区域
With tbl as
(
SELECT TIME "8:10:00" as date, "room1" as zone
UNION ALL SELECT TIME "8:12:00", "room1"
UNION ALL SELECT TIME "8:15:00", "hall"
UNION ALL SELECT TIME "8:16:00", "hall"
UNION ALL SELECT TIME "8:25:00", "hall"
UNION ALL SELECT TIME "8:29:00", "hall"
UNION ALL SELECT TIME "8:30:00", "room2"
UNION ALL SELECT TIME "8:34:00", "room2"
UNION ALL SELECT TIME "8:38:00", "room2"
UNION ALL SELECT TIME "8:42:00", "room2"
UNION ALL SELECT TIME "8:43:00", "hall"
UNION ALL SELECT TIME "8:44:00", "room2"
)
SELECT
zone_id,
zone,
min(date) as startDate,
max(date) as endDate,
time_diff(max(date),min(date),minute)+1 as time_minutes
FROM
(
SELECT *,
countif(x=0) over (ORDER BY date,x RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)+
countif(zone_change) over (ORDER BY date,x RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as zone_id
FROM
(
SELECT date,x,if(x=1,zone,"---") as zone,
time_diff(date,lag(date,2) over (order by date),minute)>5 as over_5_minutes,
zone!=lag(zone,2) over (order by date,x) as zone_change
FROM tbl, unnest([0,1]) as x
Qualify over_5_minutes or x=1
)
)
where zone!="---"
group by 1,2
order by 1
考虑以下方法
select
min(date) as startDate, max(date) as endDate,
time_diff(max(date), min(date), minute) + 1 as time, zone
from (
select *, countif(new_zone) over (partition by zone order by date) as zone_number
from (
select *,
ifnull(date - lag(date) over (partition by zone order by date) > make_interval(minute => 5)
or zone != lag(zone) over(order by date), true) as new_zone
from your_table
)
)
group by zone, zone_number
如果应用于您问题中的样本数据
with your_table as (
select time "8:10:00" as date, "room1" as zone union all
select "8:12:00", "room1" union all
select "8:15:00", "hall" union all
select "8:16:00", "hall" union all
select "8:25:00", "hall" union all
select "8:29:00", "hall" union all
select "8:30:00", "room2" union all
select "8:34:00", "room2" union all
select "8:38:00", "room2" union all
select "8:42:00", "room2"
)
输出是
遇到这样一种情况,我正在尝试计算开始时间、结束时间以及在某个位置的不同区域花费的时间。
我有系统捕捉到的带有时间戳和人所在位置的数据。
正常情况是位置发生变化,在这种情况下,结束时间应该是之前看到的值。例外情况是当一个人在 5 米或更远的距离内未被看到时,在这种情况下,endTime 应是最后一次看到的(请参阅上面所需的聚合输出的第 2 行和第 3 行)。
原始数据
date, zone
8h10m, room1
8h12m, room1
8h15m, hall
8h16m, hall
8h25m, hall
8h29m, hall
8h30m, room2
8h34m, room2
8h38m, room2
8h42m, room2
Aggregation/Summary 需要以下方式(或类似方式):
startDate, endDate, time, zone
8h10m, 8h12m, 3m, room1
8h15m, 8h16m, 2m, hall <-- special case time >5m
8h25m, 8h29m, 5m, hall
8h30, 8h42m, 9n, room2
你能告诉我如何在 SQL 中制作这样的“aggregation/summary”吗?我正在使用 BigQuery,但我相信标准 SQL 应该可以完成这项工作。
谢谢,
芮
从 Mikhail Berlyant 解决方案中,countif
的概念已用于简化此查询。这个答案能够识别一举一动,如果这个人 re-eneters 在 5 分钟内进入一个房间也是如此。请参阅 table.
需要几个步骤:
- 当 5 分钟内没有数据时,添加带有区域
---
的行:当与前一行 (lag
) 与当前行的差异超过 5 分钟时,设置over_5:minutes
为真。unnest([0,1]) as x
复制数据集,qualify
在这种情况下包含数据集。 - 按
over(order by date, x)
中的 - 用
lag
获取最后一个房间和最后一个日期。因为unnest x,往后看两行 - 将上一个房间与当前房间进行比较,如果它们不同,则将
zone_change
设置为 true。 countif(zone_change)
从第 1 天到当前日期获得zone_id
。这对应于单个区域。- 为此
zone_id
计算x
为0的时间; 5分钟没给位就是这种情况 group by zone_id
并计算最小和最大日期- 通过过滤删除
---
个区域
date, x
列对以下所有语句进行排序
With tbl as
(
SELECT TIME "8:10:00" as date, "room1" as zone
UNION ALL SELECT TIME "8:12:00", "room1"
UNION ALL SELECT TIME "8:15:00", "hall"
UNION ALL SELECT TIME "8:16:00", "hall"
UNION ALL SELECT TIME "8:25:00", "hall"
UNION ALL SELECT TIME "8:29:00", "hall"
UNION ALL SELECT TIME "8:30:00", "room2"
UNION ALL SELECT TIME "8:34:00", "room2"
UNION ALL SELECT TIME "8:38:00", "room2"
UNION ALL SELECT TIME "8:42:00", "room2"
UNION ALL SELECT TIME "8:43:00", "hall"
UNION ALL SELECT TIME "8:44:00", "room2"
)
SELECT
zone_id,
zone,
min(date) as startDate,
max(date) as endDate,
time_diff(max(date),min(date),minute)+1 as time_minutes
FROM
(
SELECT *,
countif(x=0) over (ORDER BY date,x RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)+
countif(zone_change) over (ORDER BY date,x RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as zone_id
FROM
(
SELECT date,x,if(x=1,zone,"---") as zone,
time_diff(date,lag(date,2) over (order by date),minute)>5 as over_5_minutes,
zone!=lag(zone,2) over (order by date,x) as zone_change
FROM tbl, unnest([0,1]) as x
Qualify over_5_minutes or x=1
)
)
where zone!="---"
group by 1,2
order by 1
考虑以下方法
select
min(date) as startDate, max(date) as endDate,
time_diff(max(date), min(date), minute) + 1 as time, zone
from (
select *, countif(new_zone) over (partition by zone order by date) as zone_number
from (
select *,
ifnull(date - lag(date) over (partition by zone order by date) > make_interval(minute => 5)
or zone != lag(zone) over(order by date), true) as new_zone
from your_table
)
)
group by zone, zone_number
如果应用于您问题中的样本数据
with your_table as (
select time "8:10:00" as date, "room1" as zone union all
select "8:12:00", "room1" union all
select "8:15:00", "hall" union all
select "8:16:00", "hall" union all
select "8:25:00", "hall" union all
select "8:29:00", "hall" union all
select "8:30:00", "room2" union all
select "8:34:00", "room2" union all
select "8:38:00", "room2" union all
select "8:42:00", "room2"
)
输出是