Return 每个 ID 的第一个和最后一个时间戳按顺序排列,可能有重复值和缺失值

Return first and last timestamp for each ID in sequence with possible repeated and missing values

我有一个用户列表、应用程序 ID 和 activity 全天跟踪用户 activity 的时间戳。数据的结构如下,它显示每个应用程序 ID 上每个 activity 的事件行:用户 A 将进入应用程序 123 进行 6 次活动,然后切换到应用程序 456 以记录 4 次活动,return 再次申请 123 额外 activity,等等

我试过使用 lead ()lag () 函数,但 运行 遇到数据结构问题,尤其是当给定 activity 行时=27=]。下面是我的数据示例。

|User| APPL_ID | ACTIVITY_TIME
  A    123       11/20/2020 08:11:45 AM
  A    123       11/20/2020 08:11:45 AM
  A    123       11/20/2020 08:11:45 AM
  A    123       11/20/2020 08:17:13 AM
  A    123       11/20/2020 08:17:13 AM
  A    123       11/20/2020 08:30:00 AM
  A    456       11/20/2020 09:45:02 AM
  A    456       11/20/2020 09:45:02 AM
  A    456       11/20/2020 09:55:15 AM
  A    456       11/20/2020 09:59:45 AM
  A    123       11/20/2020 10:35:00 AM
  A    789       11/20/2020 10:45:15 AM
  A    789       11/20/2020 10:50:33 AM
  B    951       11/20/2020 08:15:15 AM
  B    951       11/20/2020 08:15:15 AM
  B    951       11/20/2020 08:33:37 AM
  B    012       11/20/2020 09:13:00 AM
  C    852       11/20/2020 07:45:25 AM
  C    852       11/20/2020 07:47:41 AM
  C    741       11/20/2020 08:00:22 AM
  C    852       11/20/2020 08:25:23 AM
  C    852       11/20/2020 08:25:23 AM
  C    852       11/20/2020 08:25:23 AM
  C    852       11/20/2020 08:29:46 AM

除了需要用户第一个和最后一个activity时间戳和appl_id,我还需要计算用户在每个应用上花费的时间和应用之间的空闲时间。请注意 10:35 处的应用程序 123 的警告,其中仅记录了一个 activity,因此 IN 和 OUT 时间都相等:

|User| APPL_ID |      IN_TIME        |      OUT_TIME     |   IN_OUT_MIN  |   IDLE_MIN
  A    123       11/20/2020 08:11 AM   11/20/2020 08:30 AM      19.0            -
  A    456       11/20/2020 09:45 AM   11/20/2020 09:59 AM      14.0           75.0
  A    123       11/20/2020 10:35 AM   11/20/2020 10:35 AM       0.0           36.0
  A    789       11/20/2020 10:45 AM   11/20/2020 10:50 AM       5.0           10.0
  B    951       11/20/2020 08:15 AM   11/20/2020 08:33 AM      18.0            -
  B    012       11/20/2020 09:13 AM   11/20/2020 09:13 AM       0.0           50.0
  C    852       11/20/2020 07:45 AM   11/20/2020 07:47 AM       2.0            -
  C    741       11/20/2020 08:00 AM   11/20/2020 08:00 AM       0.0           13.0
  C    852       11/20/2020 08:25 AM   11/20/2020 08:29 AM       4.0           25.0

这些是计算结果:

in_out_time = out_time - in_time
idle_min = in_time - previous out_time

如果之前的 OUT 时间丢失或来自更早的日期,则 idle_min 计算需要 return 空白。

这是一个缺口和孤岛问题。这是一种使用行号之间的差异来识别“相邻行”(岛)组的方法。要计算每个间隙的持续时间,我们可以再次使用 window 函数:

select user_id, appl_id,
    min(activity_time) as in_time,
    max(activity_time) as out_time,
    (max(activity_time) - min(activity_time)) * 24 * 60 as in_out_min,
    (min(activity_time) - lag(max(activity_time)) over(partition by user_id order by min(activity_time))) * 24 * 60 as idle_min
from (
    select t.*,
        row_number() over(partition by user_id order by activity_time) rn1,
        row_number() over(partition by user_id, appl_id order by activity_time) rn2
    from mytable t
) t
group by user_id, appl_id, rn1 - rn2
order by user_id, in_time

这里是 demo on DB Fiddle(我将持续时间四舍五入以便于阅读):

USER_ID | APPL_ID | IN_TIME             | OUT_TIME            | IN_OUT_MIN | IDLE_MIN
:------ | ------: | :------------------ | :------------------ | ---------: | -------:
A       |     123 | 11/20/2020 08:11 AM | 11/20/2020 08:30 AM |         19 |     null
A       |     456 | 11/20/2020 09:45 AM | 11/20/2020 09:59 AM |         14 |       75
A       |     123 | 11/20/2020 10:35 AM | 11/20/2020 10:35 AM |          0 |       36
A       |     789 | 11/20/2020 10:45 AM | 11/20/2020 10:50 AM |          5 |       10
B       |     951 | 11/20/2020 08:15 AM | 11/20/2020 08:33 AM |         18 |     null
B       |      12 | 11/20/2020 09:13 AM | 11/20/2020 09:13 AM |          0 |       40
C       |     852 | 11/20/2020 07:45 AM | 11/20/2020 07:47 AM |          2 |     null
C       |     741 | 11/20/2020 08:00 AM | 11/20/2020 08:00 AM |          0 |       13
C       |     852 | 11/20/2020 08:25 AM | 11/20/2020 08:29 AM |          4 |       25

这是帮助解决时间戳重复问题的最终代码。注意:感谢上面的用户 (@GMB),因为他提供了使这成为可能的最终响应。

select user_id, appl_id,
    min(activity_date) as in_time,
    max(activity_date) as out_time,
    trunc((max(activity_date) - min(activity_date)) * 1440, 2) as in_out_min,
    trunc((min(activity_date) - lag(max(activity_date)) over(partition by user_id order by min(activity_date))) * 1440, 2) as idle_min
from (
    select activity_date, user_id, appl_id,
        row_number() over(partition by user_id order by activity_date) rn1,
        row_number() over(partition by user_id, appl_id order by activity_date) rn2
    from 
    (select
    activity_date, user_id, appl_id, count(*)
    from cf.mytable tt
    where
        user_id in ('A','B','C','D')
        and activity_date >= trunc(sysdate - 4,'DD')
        and activity_date <= trunc(sysdate - 3,'DD')
    group by
        activity_date, user_id, appl_id) tt
) t
group by user_id, appl_id, rn1 - rn2
order by user_id, in_time