如何将时间戳彼此接近的会话分组?

How to group sessions that have timestamps close to each other?

我的场景要求我将间隔小于 60 秒的会话视为同一会话。

数据如下。

Min_Timestamp                Max_Timestamp                Device_ID  Session_ID  Prev_Max_Timestamp           Diff_Sec
2019-12-03 23:05:30.416 UTC  2019-12-03 23:09:13.502 UTC  AAAAA      I90HYTRFJI  null                         null
2019-12-03 23:09:21.517 UTC  2019-12-03 23:09:53.353 UTC  AAAAA      98UHIGSNJR  2019-12-03 23:09:13.502 UTC  8
2019-12-03 00:00:28.933 UTC  2019-12-03 00:09:03.473 UTC  BBBBB      32QE8Y76TG  null                         null
2019-12-03 00:09:19.106 UTC  2019-12-03 00:23:26.554 UTC  BBBBB      R4GUY432AD  2019-12-03 00:09:03.473 UTC  16
2019-12-03 00:23:26.818 UTC  2019-12-03 00:23:26.837 UTC  BBBBB      E32GUYE328  2019-12-03 00:23:26.554 UTC  0
2019-12-03 17:00:32.160 UTC  2019-12-03 17:03:48.758 UTC  BBBBB      GY1EW32876  2019-12-03 00:23:26.837 UTC  59825
2019-12-03 17:03:58.069 UTC  2019-12-03 17:17:12.408 UTC  BBBBB      2876T128Y7  2019-12-03 17:03:48.758 UTC  9
2019-12-03 17:18:24.528 UTC  2019-12-03 17:18:27.516 UTC  BBBBB      098U6598U5  2019-12-03 17:17:12.408 UTC  73
2019-12-03 16:30:29.970 UTC  2019-12-03 18:44:18.972 UTC  CCCCC      UWI4UII2J4  null                         null
2019-12-04 17:32:19.285 UTC  2019-12-04 17:32:24.668 UTC  CCCCC      G3247ROIUH  2019-12-03 18:44:18.972 UTC  82080

将间隔小于 60 秒但仍按设备分开的会话组合在一起。它看起来像这样。

Min_Timestamp                Max_Timestamp                Device_ID  Session_ID  Prev_Max_Timestamp           Diff_Sec
2019-12-03 23:05:30.416 UTC  2019-12-03 23:09:13.502 UTC  AAAAA      I90HYTRFJI  null                         null
2019-12-03 23:09:21.517 UTC  2019-12-03 23:09:53.353 UTC  AAAAA      98UHIGSNJR  2019-12-03 23:09:13.502 UTC  8

2019-12-03 00:00:28.933 UTC  2019-12-03 00:09:03.473 UTC  BBBBB      32QE8Y76TG  null                         null
2019-12-03 00:09:19.106 UTC  2019-12-03 00:23:26.554 UTC  BBBBB      R4GUY432AD  2019-12-03 00:09:03.473 UTC  16
2019-12-03 00:23:26.818 UTC  2019-12-03 00:23:26.837 UTC  BBBBB      E32GUYE328  2019-12-03 00:23:26.554 UTC  0

2019-12-03 17:00:32.160 UTC  2019-12-03 17:03:48.758 UTC  BBBBB      GY1EW32876  2019-12-03 00:23:26.837 UTC  59825
2019-12-03 17:03:58.069 UTC  2019-12-03 17:17:12.408 UTC  BBBBB      2876T128Y7  2019-12-03 17:03:48.758 UTC  9
2019-12-03 17:18:24.528 UTC  2019-12-03 17:18:27.516 UTC  BBBBB      098U6598U5  2019-12-03 17:17:12.408 UTC  73

2019-12-03 16:30:29.970 UTC  2019-12-03 18:44:18.972 UTC  CCCCC      UWI4UII2J4  null                         null

2019-12-04 17:32:19.285 UTC  2019-12-04 17:32:24.668 UTC  CCCCC      G3247ROIUH  2019-12-03 18:44:18.972 UTC  82080

我希望能够得到像这样的东西。 Session_ID不需要像A1、B1、C1等,可以简单的作为session的第一个值。请注意,最新的 Max_Timestamp 现在是新的 Max_Timestamp

Min_Timestamp                Max_Timestamp                Device_ID  Session_ID
2019-12-03 23:05:30.416 UTC  2019-12-03 23:09:53.353 UTC  AAAAA      A1          
2019-12-03 00:00:28.933 UTC  2019-12-03 00:23:26.837 UTC  BBBBB      B1
2019-12-03 17:00:32.160 UTC  2019-12-03 17:18:27.516 UTC  BBBBB      B2
2019-12-03 16:30:29.970 UTC  2019-12-03 18:44:18.972 UTC  CCCCC      C1
2019-12-04 17:32:19.285 UTC  2019-12-04 17:32:24.668 UTC  CCCCC      C2

我的想法是让属于同一组的所有 Session_ID 都相同。然后按 Device_IDSession_ID 分组得到 min(Min_Timestamp)max(Max_Timestamp). 我尝试 fiddle 和 Session_ID 上的 first_value(),但我不知道如何正确分区。

最好在旧版本中实现这一点。如果没有,标准也可以。

以下适用于 BigQuery Standard SQL(如果您想要 - 只需 "translate" 到 Legacy - 但建议无论如何都要迁移到 Standard!!!所以现在就做并在下面使用)

#standardSQL
SELECT MIN(Min_Timestamp) AS Min_Timestamp, MAX(Max_Timestamp) AS Max_Timestamp, Device_ID, Session_ID
FROM (
  SELECT * EXCEPT(flag, Session_ID), 
    CONCAT(Device_ID, CAST(COUNTIF(flag) OVER(PARTITION BY Device_ID ORDER BY Max_Timestamp) AS STRING)) AS Session_ID
  FROM (
    SELECT *, 
      IFNULL(TIMESTAMP_DIFF(Min_Timestamp, LAG(Max_Timestamp) OVER(PARTITION BY Device_ID ORDER BY Max_Timestamp), SECOND), 999) > 60 flag
    FROM `project.dataset.table`
  )
)
GROUP BY Device_ID, Session_ID

您可以使用您问题中的示例数据来测试和使用上面的示例,如下例所示

#standardSQL
WITH `project.dataset.table` AS (
  SELECT TIMESTAMP '2019-12-03 23:05:30.416 UTC' Min_Timestamp, TIMESTAMP '2019-12-03 23:09:13.502 UTC' Max_Timestamp, 'AAAAA' Device_ID, 'I90HYTRFJI' Session_ID UNION ALL
  SELECT '2019-12-03 23:09:21.517 UTC', '2019-12-03 23:09:53.353 UTC', 'AAAAA', '98UHIGSNJR' UNION ALL
  SELECT '2019-12-03 00:00:28.933 UTC', '2019-12-03 00:09:03.473 UTC', 'BBBBB', '32QE8Y76TG' UNION ALL
  SELECT '2019-12-03 00:09:19.106 UTC', '2019-12-03 00:23:26.554 UTC', 'BBBBB', 'R4GUY432AD' UNION ALL
  SELECT '2019-12-03 00:23:26.818 UTC', '2019-12-03 00:23:26.837 UTC', 'BBBBB', 'E32GUYE328' UNION ALL
  SELECT '2019-12-03 17:00:32.160 UTC', '2019-12-03 17:03:48.758 UTC', 'BBBBB', 'GY1EW32876' UNION ALL
  SELECT '2019-12-03 17:03:58.069 UTC', '2019-12-03 17:17:12.408 UTC', 'BBBBB', '2876T128Y7' UNION ALL
  SELECT '2019-12-03 17:18:24.528 UTC', '2019-12-03 17:18:27.516 UTC', 'BBBBB', '098U6598U5' UNION ALL
  SELECT '2019-12-03 16:30:29.970 UTC', '2019-12-03 18:44:18.972 UTC', 'CCCCC', 'UWI4UII2J4' UNION ALL
  SELECT '2019-12-04 17:32:19.285 UTC', '2019-12-04 17:32:24.668 UTC', 'CCCCC', 'G3247ROIUH' 
)
SELECT MIN(Min_Timestamp) AS Min_Timestamp, MAX(Max_Timestamp) AS Max_Timestamp, Device_ID, Session_ID
FROM (
  SELECT * EXCEPT(flag, Session_ID), 
    CONCAT(Device_ID, CAST(COUNTIF(flag) OVER(PARTITION BY Device_ID ORDER BY Max_Timestamp) AS STRING)) AS Session_ID
  FROM (
    SELECT *, 
      IFNULL(TIMESTAMP_DIFF(Min_Timestamp, LAG(Max_Timestamp) OVER(PARTITION BY Device_ID ORDER BY Max_Timestamp), SECOND), 999) > 60 flag
    FROM `project.dataset.table`
  )
)
GROUP BY Device_ID, Session_ID
-- ORDER BY Device_ID, Session_ID  

输出

Row Min_Timestamp               Max_Timestamp               Device_ID   Session_ID   
1   2019-12-03 23:05:30.416 UTC 2019-12-03 23:09:53.353 UTC AAAAA       AAAAA1   
2   2019-12-03 00:00:28.933 UTC 2019-12-03 00:23:26.837 UTC BBBBB       BBBBB1   
3   2019-12-03 17:00:32.160 UTC 2019-12-03 17:17:12.408 UTC BBBBB       BBBBB2   
4   2019-12-03 17:18:24.528 UTC 2019-12-03 17:18:27.516 UTC BBBBB       BBBBB3   
5   2019-12-03 16:30:29.970 UTC 2019-12-03 18:44:18.972 UTC CCCCC       CCCCC1   
6   2019-12-04 17:32:19.285 UTC 2019-12-04 17:32:24.668 UTC CCCCC       CCCCC2