Return 每个 ID 的第一个和最后一个时间戳按顺序排列,可能有重复值和缺失值
Return first and last timestamp for each ID in sequence with possible repeated and missing values
我有一个用户列表、应用程序 ID 和 activity 全天跟踪用户 activity 的时间戳。数据的结构如下,它显示每个应用程序 ID 上每个 activity 的事件行:用户 A 将进入应用程序 123 进行 6 次活动,然后切换到应用程序 456 以记录 4 次活动,return 再次申请 123 额外 activity,等等
我试过使用 lead ()
和 lag ()
函数,但 运行 遇到数据结构问题,尤其是当给定 activity 行时=27=]。下面是我的数据示例。
|User| APPL_ID | ACTIVITY_TIME
A 123 11/20/2020 08:11:45 AM
A 123 11/20/2020 08:11:45 AM
A 123 11/20/2020 08:11:45 AM
A 123 11/20/2020 08:17:13 AM
A 123 11/20/2020 08:17:13 AM
A 123 11/20/2020 08:30:00 AM
A 456 11/20/2020 09:45:02 AM
A 456 11/20/2020 09:45:02 AM
A 456 11/20/2020 09:55:15 AM
A 456 11/20/2020 09:59:45 AM
A 123 11/20/2020 10:35:00 AM
A 789 11/20/2020 10:45:15 AM
A 789 11/20/2020 10:50:33 AM
B 951 11/20/2020 08:15:15 AM
B 951 11/20/2020 08:15:15 AM
B 951 11/20/2020 08:33:37 AM
B 012 11/20/2020 09:13:00 AM
C 852 11/20/2020 07:45:25 AM
C 852 11/20/2020 07:47:41 AM
C 741 11/20/2020 08:00:22 AM
C 852 11/20/2020 08:25:23 AM
C 852 11/20/2020 08:25:23 AM
C 852 11/20/2020 08:25:23 AM
C 852 11/20/2020 08:29:46 AM
除了需要用户第一个和最后一个activity时间戳和appl_id,我还需要计算用户在每个应用上花费的时间和应用之间的空闲时间。请注意 10:35 处的应用程序 123 的警告,其中仅记录了一个 activity,因此 IN 和 OUT 时间都相等:
|User| APPL_ID | IN_TIME | OUT_TIME | IN_OUT_MIN | IDLE_MIN
A 123 11/20/2020 08:11 AM 11/20/2020 08:30 AM 19.0 -
A 456 11/20/2020 09:45 AM 11/20/2020 09:59 AM 14.0 75.0
A 123 11/20/2020 10:35 AM 11/20/2020 10:35 AM 0.0 36.0
A 789 11/20/2020 10:45 AM 11/20/2020 10:50 AM 5.0 10.0
B 951 11/20/2020 08:15 AM 11/20/2020 08:33 AM 18.0 -
B 012 11/20/2020 09:13 AM 11/20/2020 09:13 AM 0.0 50.0
C 852 11/20/2020 07:45 AM 11/20/2020 07:47 AM 2.0 -
C 741 11/20/2020 08:00 AM 11/20/2020 08:00 AM 0.0 13.0
C 852 11/20/2020 08:25 AM 11/20/2020 08:29 AM 4.0 25.0
这些是计算结果:
in_out_time = out_time - in_time
idle_min = in_time - previous out_time
如果之前的 OUT 时间丢失或来自更早的日期,则 idle_min 计算需要 return 空白。
这是一个缺口和孤岛问题。这是一种使用行号之间的差异来识别“相邻行”(岛)组的方法。要计算每个间隙的持续时间,我们可以再次使用 window 函数:
select user_id, appl_id,
min(activity_time) as in_time,
max(activity_time) as out_time,
(max(activity_time) - min(activity_time)) * 24 * 60 as in_out_min,
(min(activity_time) - lag(max(activity_time)) over(partition by user_id order by min(activity_time))) * 24 * 60 as idle_min
from (
select t.*,
row_number() over(partition by user_id order by activity_time) rn1,
row_number() over(partition by user_id, appl_id order by activity_time) rn2
from mytable t
) t
group by user_id, appl_id, rn1 - rn2
order by user_id, in_time
这里是 demo on DB Fiddle(我将持续时间四舍五入以便于阅读):
USER_ID | APPL_ID | IN_TIME | OUT_TIME | IN_OUT_MIN | IDLE_MIN
:------ | ------: | :------------------ | :------------------ | ---------: | -------:
A | 123 | 11/20/2020 08:11 AM | 11/20/2020 08:30 AM | 19 | null
A | 456 | 11/20/2020 09:45 AM | 11/20/2020 09:59 AM | 14 | 75
A | 123 | 11/20/2020 10:35 AM | 11/20/2020 10:35 AM | 0 | 36
A | 789 | 11/20/2020 10:45 AM | 11/20/2020 10:50 AM | 5 | 10
B | 951 | 11/20/2020 08:15 AM | 11/20/2020 08:33 AM | 18 | null
B | 12 | 11/20/2020 09:13 AM | 11/20/2020 09:13 AM | 0 | 40
C | 852 | 11/20/2020 07:45 AM | 11/20/2020 07:47 AM | 2 | null
C | 741 | 11/20/2020 08:00 AM | 11/20/2020 08:00 AM | 0 | 13
C | 852 | 11/20/2020 08:25 AM | 11/20/2020 08:29 AM | 4 | 25
这是帮助解决时间戳重复问题的最终代码。注意:感谢上面的用户 (@GMB),因为他提供了使这成为可能的最终响应。
select user_id, appl_id,
min(activity_date) as in_time,
max(activity_date) as out_time,
trunc((max(activity_date) - min(activity_date)) * 1440, 2) as in_out_min,
trunc((min(activity_date) - lag(max(activity_date)) over(partition by user_id order by min(activity_date))) * 1440, 2) as idle_min
from (
select activity_date, user_id, appl_id,
row_number() over(partition by user_id order by activity_date) rn1,
row_number() over(partition by user_id, appl_id order by activity_date) rn2
from
(select
activity_date, user_id, appl_id, count(*)
from cf.mytable tt
where
user_id in ('A','B','C','D')
and activity_date >= trunc(sysdate - 4,'DD')
and activity_date <= trunc(sysdate - 3,'DD')
group by
activity_date, user_id, appl_id) tt
) t
group by user_id, appl_id, rn1 - rn2
order by user_id, in_time
我有一个用户列表、应用程序 ID 和 activity 全天跟踪用户 activity 的时间戳。数据的结构如下,它显示每个应用程序 ID 上每个 activity 的事件行:用户 A 将进入应用程序 123 进行 6 次活动,然后切换到应用程序 456 以记录 4 次活动,return 再次申请 123 额外 activity,等等
我试过使用 lead ()
和 lag ()
函数,但 运行 遇到数据结构问题,尤其是当给定 activity 行时=27=]。下面是我的数据示例。
|User| APPL_ID | ACTIVITY_TIME
A 123 11/20/2020 08:11:45 AM
A 123 11/20/2020 08:11:45 AM
A 123 11/20/2020 08:11:45 AM
A 123 11/20/2020 08:17:13 AM
A 123 11/20/2020 08:17:13 AM
A 123 11/20/2020 08:30:00 AM
A 456 11/20/2020 09:45:02 AM
A 456 11/20/2020 09:45:02 AM
A 456 11/20/2020 09:55:15 AM
A 456 11/20/2020 09:59:45 AM
A 123 11/20/2020 10:35:00 AM
A 789 11/20/2020 10:45:15 AM
A 789 11/20/2020 10:50:33 AM
B 951 11/20/2020 08:15:15 AM
B 951 11/20/2020 08:15:15 AM
B 951 11/20/2020 08:33:37 AM
B 012 11/20/2020 09:13:00 AM
C 852 11/20/2020 07:45:25 AM
C 852 11/20/2020 07:47:41 AM
C 741 11/20/2020 08:00:22 AM
C 852 11/20/2020 08:25:23 AM
C 852 11/20/2020 08:25:23 AM
C 852 11/20/2020 08:25:23 AM
C 852 11/20/2020 08:29:46 AM
除了需要用户第一个和最后一个activity时间戳和appl_id,我还需要计算用户在每个应用上花费的时间和应用之间的空闲时间。请注意 10:35 处的应用程序 123 的警告,其中仅记录了一个 activity,因此 IN 和 OUT 时间都相等:
|User| APPL_ID | IN_TIME | OUT_TIME | IN_OUT_MIN | IDLE_MIN
A 123 11/20/2020 08:11 AM 11/20/2020 08:30 AM 19.0 -
A 456 11/20/2020 09:45 AM 11/20/2020 09:59 AM 14.0 75.0
A 123 11/20/2020 10:35 AM 11/20/2020 10:35 AM 0.0 36.0
A 789 11/20/2020 10:45 AM 11/20/2020 10:50 AM 5.0 10.0
B 951 11/20/2020 08:15 AM 11/20/2020 08:33 AM 18.0 -
B 012 11/20/2020 09:13 AM 11/20/2020 09:13 AM 0.0 50.0
C 852 11/20/2020 07:45 AM 11/20/2020 07:47 AM 2.0 -
C 741 11/20/2020 08:00 AM 11/20/2020 08:00 AM 0.0 13.0
C 852 11/20/2020 08:25 AM 11/20/2020 08:29 AM 4.0 25.0
这些是计算结果:
in_out_time = out_time - in_time
idle_min = in_time - previous out_time
如果之前的 OUT 时间丢失或来自更早的日期,则 idle_min 计算需要 return 空白。
这是一个缺口和孤岛问题。这是一种使用行号之间的差异来识别“相邻行”(岛)组的方法。要计算每个间隙的持续时间,我们可以再次使用 window 函数:
select user_id, appl_id,
min(activity_time) as in_time,
max(activity_time) as out_time,
(max(activity_time) - min(activity_time)) * 24 * 60 as in_out_min,
(min(activity_time) - lag(max(activity_time)) over(partition by user_id order by min(activity_time))) * 24 * 60 as idle_min
from (
select t.*,
row_number() over(partition by user_id order by activity_time) rn1,
row_number() over(partition by user_id, appl_id order by activity_time) rn2
from mytable t
) t
group by user_id, appl_id, rn1 - rn2
order by user_id, in_time
这里是 demo on DB Fiddle(我将持续时间四舍五入以便于阅读):
USER_ID | APPL_ID | IN_TIME | OUT_TIME | IN_OUT_MIN | IDLE_MIN :------ | ------: | :------------------ | :------------------ | ---------: | -------: A | 123 | 11/20/2020 08:11 AM | 11/20/2020 08:30 AM | 19 | null A | 456 | 11/20/2020 09:45 AM | 11/20/2020 09:59 AM | 14 | 75 A | 123 | 11/20/2020 10:35 AM | 11/20/2020 10:35 AM | 0 | 36 A | 789 | 11/20/2020 10:45 AM | 11/20/2020 10:50 AM | 5 | 10 B | 951 | 11/20/2020 08:15 AM | 11/20/2020 08:33 AM | 18 | null B | 12 | 11/20/2020 09:13 AM | 11/20/2020 09:13 AM | 0 | 40 C | 852 | 11/20/2020 07:45 AM | 11/20/2020 07:47 AM | 2 | null C | 741 | 11/20/2020 08:00 AM | 11/20/2020 08:00 AM | 0 | 13 C | 852 | 11/20/2020 08:25 AM | 11/20/2020 08:29 AM | 4 | 25
这是帮助解决时间戳重复问题的最终代码。注意:感谢上面的用户 (@GMB),因为他提供了使这成为可能的最终响应。
select user_id, appl_id,
min(activity_date) as in_time,
max(activity_date) as out_time,
trunc((max(activity_date) - min(activity_date)) * 1440, 2) as in_out_min,
trunc((min(activity_date) - lag(max(activity_date)) over(partition by user_id order by min(activity_date))) * 1440, 2) as idle_min
from (
select activity_date, user_id, appl_id,
row_number() over(partition by user_id order by activity_date) rn1,
row_number() over(partition by user_id, appl_id order by activity_date) rn2
from
(select
activity_date, user_id, appl_id, count(*)
from cf.mytable tt
where
user_id in ('A','B','C','D')
and activity_date >= trunc(sysdate - 4,'DD')
and activity_date <= trunc(sysdate - 3,'DD')
group by
activity_date, user_id, appl_id) tt
) t
group by user_id, appl_id, rn1 - rn2
order by user_id, in_time