如何合并(分组)属于同一会话的行
how to merge (group) rows belonging to the same session
访客可以 OPEN
房子的前门,然后 ENTER
房子里的几个房间。说完这一切,他便OPEN
再次走出前门,离开家门。这给出了以下示例数据:
13:00 John OPEN
13:00 John ENTER Hall
13:30 John ENTER Kitchen
13:45 John ENTER Living room
14:00 John OPEN
13:30 Steve OPEN
13:30 Steve ENTER Hall
13:40 Steve ENTER Stairs
14:00 Steve ENTER Bed room
16:00 Steve ENTER Stairs
16:10 Steve OPEN
所以换句话说,我们总是有一个 OPEN
条目,然后是一个或多个 ENTER
条目,最后是一个 OPEN
条目。还有,房子里可以同时有多个访客,也可以多次访问房子,完全没有限制。
让我们定义一个从OPEN
到OPEN
的序列作为一个会话。我现在想为包含所有已发生事件的每个会话创建一行,如下所示:
[13:00,14:00) John (13:00,Hall),(13:30,Kitchen),(13:45,Living room)
[13:30,16:10) Steve (13:30,Hall),(13:40,Stairs),(14:00,Bed room),(16:00,Stairs)
如何才能有效地做到这一点?
我有一个可行的 n^2
解决方案,它首先为每个会话获取第一个和最后一个 ENTER
(使用 window 函数 lead
和 lag
以及然后与前一个进行比较),然后在外循环中搜索所有交错 ENTER
条目。这显然表现不佳。
有没有办法扫描一次数据集,将属于同一会话的所有 ENTER
条目标记为唯一序列号,然后最后按该序列分组?我为此伤透了脑筋。
一个可能的解决方案是 SUM
OPEN
的出现次数直到 table 中的当前行(按名称分区并按时间排序),并将其除以 2获取当前访问号码。这可以用来对结果进行分组:
WITH CTE AS (
SELECT *,
(1 + SUM(CASE WHEN action = 'OPEN' THEN 1 ELSE 0 END) OVER (PARTITION BY name ORDER BY time)) / 2 AS access
FROM data
)
SELECT MIN(time), MAX(time), name, ARRAY_AGG(time || ',' || action) AS actions
FROM CTE
GROUP BY name, access
ORDER BY MIN(time), name
输出(我的扩展演示,由 John
第二次访问):
min max name actions
13:00 14:00 John ["13:00,OPEN","13:00,ENTER Hall","13:30,ENTER Kitchen","13:45,ENTER Living room","14:00,OPEN"]
13:30 16:10 Steve ["13:30,OPEN","13:30,ENTER Hall","13:40,ENTER Stairs","14:00,ENTER Bed room","16:00,ENTER Stairs","16:10,OPEN"]
15:00 16:00 John ["15:00,OPEN","15:00,ENTER Hall","15:30,ENTER Kitchen","15:45,ENTER Living room","16:00,OPEN"]
访客可以 OPEN
房子的前门,然后 ENTER
房子里的几个房间。说完这一切,他便OPEN
再次走出前门,离开家门。这给出了以下示例数据:
13:00 John OPEN
13:00 John ENTER Hall
13:30 John ENTER Kitchen
13:45 John ENTER Living room
14:00 John OPEN
13:30 Steve OPEN
13:30 Steve ENTER Hall
13:40 Steve ENTER Stairs
14:00 Steve ENTER Bed room
16:00 Steve ENTER Stairs
16:10 Steve OPEN
所以换句话说,我们总是有一个 OPEN
条目,然后是一个或多个 ENTER
条目,最后是一个 OPEN
条目。还有,房子里可以同时有多个访客,也可以多次访问房子,完全没有限制。
让我们定义一个从OPEN
到OPEN
的序列作为一个会话。我现在想为包含所有已发生事件的每个会话创建一行,如下所示:
[13:00,14:00) John (13:00,Hall),(13:30,Kitchen),(13:45,Living room)
[13:30,16:10) Steve (13:30,Hall),(13:40,Stairs),(14:00,Bed room),(16:00,Stairs)
如何才能有效地做到这一点?
我有一个可行的 n^2
解决方案,它首先为每个会话获取第一个和最后一个 ENTER
(使用 window 函数 lead
和 lag
以及然后与前一个进行比较),然后在外循环中搜索所有交错 ENTER
条目。这显然表现不佳。
有没有办法扫描一次数据集,将属于同一会话的所有 ENTER
条目标记为唯一序列号,然后最后按该序列分组?我为此伤透了脑筋。
一个可能的解决方案是 SUM
OPEN
的出现次数直到 table 中的当前行(按名称分区并按时间排序),并将其除以 2获取当前访问号码。这可以用来对结果进行分组:
WITH CTE AS (
SELECT *,
(1 + SUM(CASE WHEN action = 'OPEN' THEN 1 ELSE 0 END) OVER (PARTITION BY name ORDER BY time)) / 2 AS access
FROM data
)
SELECT MIN(time), MAX(time), name, ARRAY_AGG(time || ',' || action) AS actions
FROM CTE
GROUP BY name, access
ORDER BY MIN(time), name
输出(我的扩展演示,由 John
第二次访问):
min max name actions
13:00 14:00 John ["13:00,OPEN","13:00,ENTER Hall","13:30,ENTER Kitchen","13:45,ENTER Living room","14:00,OPEN"]
13:30 16:10 Steve ["13:30,OPEN","13:30,ENTER Hall","13:40,ENTER Stairs","14:00,ENTER Bed room","16:00,ENTER Stairs","16:10,OPEN"]
15:00 16:00 John ["15:00,OPEN","15:00,ENTER Hall","15:30,ENTER Kitchen","15:45,ENTER Living room","16:00,OPEN"]