如何将时间戳彼此接近的会话分组?
How to group sessions that have timestamps close to each other?
我的场景要求我将间隔小于 60 秒的会话视为同一会话。
数据如下。
Min_Timestamp Max_Timestamp Device_ID Session_ID Prev_Max_Timestamp Diff_Sec
2019-12-03 23:05:30.416 UTC 2019-12-03 23:09:13.502 UTC AAAAA I90HYTRFJI null null
2019-12-03 23:09:21.517 UTC 2019-12-03 23:09:53.353 UTC AAAAA 98UHIGSNJR 2019-12-03 23:09:13.502 UTC 8
2019-12-03 00:00:28.933 UTC 2019-12-03 00:09:03.473 UTC BBBBB 32QE8Y76TG null null
2019-12-03 00:09:19.106 UTC 2019-12-03 00:23:26.554 UTC BBBBB R4GUY432AD 2019-12-03 00:09:03.473 UTC 16
2019-12-03 00:23:26.818 UTC 2019-12-03 00:23:26.837 UTC BBBBB E32GUYE328 2019-12-03 00:23:26.554 UTC 0
2019-12-03 17:00:32.160 UTC 2019-12-03 17:03:48.758 UTC BBBBB GY1EW32876 2019-12-03 00:23:26.837 UTC 59825
2019-12-03 17:03:58.069 UTC 2019-12-03 17:17:12.408 UTC BBBBB 2876T128Y7 2019-12-03 17:03:48.758 UTC 9
2019-12-03 17:18:24.528 UTC 2019-12-03 17:18:27.516 UTC BBBBB 098U6598U5 2019-12-03 17:17:12.408 UTC 73
2019-12-03 16:30:29.970 UTC 2019-12-03 18:44:18.972 UTC CCCCC UWI4UII2J4 null null
2019-12-04 17:32:19.285 UTC 2019-12-04 17:32:24.668 UTC CCCCC G3247ROIUH 2019-12-03 18:44:18.972 UTC 82080
将间隔小于 60 秒但仍按设备分开的会话组合在一起。它看起来像这样。
Min_Timestamp Max_Timestamp Device_ID Session_ID Prev_Max_Timestamp Diff_Sec
2019-12-03 23:05:30.416 UTC 2019-12-03 23:09:13.502 UTC AAAAA I90HYTRFJI null null
2019-12-03 23:09:21.517 UTC 2019-12-03 23:09:53.353 UTC AAAAA 98UHIGSNJR 2019-12-03 23:09:13.502 UTC 8
2019-12-03 00:00:28.933 UTC 2019-12-03 00:09:03.473 UTC BBBBB 32QE8Y76TG null null
2019-12-03 00:09:19.106 UTC 2019-12-03 00:23:26.554 UTC BBBBB R4GUY432AD 2019-12-03 00:09:03.473 UTC 16
2019-12-03 00:23:26.818 UTC 2019-12-03 00:23:26.837 UTC BBBBB E32GUYE328 2019-12-03 00:23:26.554 UTC 0
2019-12-03 17:00:32.160 UTC 2019-12-03 17:03:48.758 UTC BBBBB GY1EW32876 2019-12-03 00:23:26.837 UTC 59825
2019-12-03 17:03:58.069 UTC 2019-12-03 17:17:12.408 UTC BBBBB 2876T128Y7 2019-12-03 17:03:48.758 UTC 9
2019-12-03 17:18:24.528 UTC 2019-12-03 17:18:27.516 UTC BBBBB 098U6598U5 2019-12-03 17:17:12.408 UTC 73
2019-12-03 16:30:29.970 UTC 2019-12-03 18:44:18.972 UTC CCCCC UWI4UII2J4 null null
2019-12-04 17:32:19.285 UTC 2019-12-04 17:32:24.668 UTC CCCCC G3247ROIUH 2019-12-03 18:44:18.972 UTC 82080
我希望能够得到像这样的东西。 Session_ID
不需要像A1、B1、C1等,可以简单的作为session的第一个值。请注意,最新的 Max_Timestamp
现在是新的 Max_Timestamp
。
Min_Timestamp Max_Timestamp Device_ID Session_ID
2019-12-03 23:05:30.416 UTC 2019-12-03 23:09:53.353 UTC AAAAA A1
2019-12-03 00:00:28.933 UTC 2019-12-03 00:23:26.837 UTC BBBBB B1
2019-12-03 17:00:32.160 UTC 2019-12-03 17:18:27.516 UTC BBBBB B2
2019-12-03 16:30:29.970 UTC 2019-12-03 18:44:18.972 UTC CCCCC C1
2019-12-04 17:32:19.285 UTC 2019-12-04 17:32:24.668 UTC CCCCC C2
我的想法是让属于同一组的所有 Session_ID
都相同。然后按 Device_ID
和 Session_ID
分组得到 min(Min_Timestamp)
和 max(Max_Timestamp).
我尝试 fiddle 和 Session_ID
上的 first_value()
,但我不知道如何正确分区。
最好在旧版本中实现这一点。如果没有,标准也可以。
以下适用于 BigQuery Standard SQL(如果您想要 - 只需 "translate" 到 Legacy - 但建议无论如何都要迁移到 Standard!!!所以现在就做并在下面使用)
#standardSQL
SELECT MIN(Min_Timestamp) AS Min_Timestamp, MAX(Max_Timestamp) AS Max_Timestamp, Device_ID, Session_ID
FROM (
SELECT * EXCEPT(flag, Session_ID),
CONCAT(Device_ID, CAST(COUNTIF(flag) OVER(PARTITION BY Device_ID ORDER BY Max_Timestamp) AS STRING)) AS Session_ID
FROM (
SELECT *,
IFNULL(TIMESTAMP_DIFF(Min_Timestamp, LAG(Max_Timestamp) OVER(PARTITION BY Device_ID ORDER BY Max_Timestamp), SECOND), 999) > 60 flag
FROM `project.dataset.table`
)
)
GROUP BY Device_ID, Session_ID
您可以使用您问题中的示例数据来测试和使用上面的示例,如下例所示
#standardSQL
WITH `project.dataset.table` AS (
SELECT TIMESTAMP '2019-12-03 23:05:30.416 UTC' Min_Timestamp, TIMESTAMP '2019-12-03 23:09:13.502 UTC' Max_Timestamp, 'AAAAA' Device_ID, 'I90HYTRFJI' Session_ID UNION ALL
SELECT '2019-12-03 23:09:21.517 UTC', '2019-12-03 23:09:53.353 UTC', 'AAAAA', '98UHIGSNJR' UNION ALL
SELECT '2019-12-03 00:00:28.933 UTC', '2019-12-03 00:09:03.473 UTC', 'BBBBB', '32QE8Y76TG' UNION ALL
SELECT '2019-12-03 00:09:19.106 UTC', '2019-12-03 00:23:26.554 UTC', 'BBBBB', 'R4GUY432AD' UNION ALL
SELECT '2019-12-03 00:23:26.818 UTC', '2019-12-03 00:23:26.837 UTC', 'BBBBB', 'E32GUYE328' UNION ALL
SELECT '2019-12-03 17:00:32.160 UTC', '2019-12-03 17:03:48.758 UTC', 'BBBBB', 'GY1EW32876' UNION ALL
SELECT '2019-12-03 17:03:58.069 UTC', '2019-12-03 17:17:12.408 UTC', 'BBBBB', '2876T128Y7' UNION ALL
SELECT '2019-12-03 17:18:24.528 UTC', '2019-12-03 17:18:27.516 UTC', 'BBBBB', '098U6598U5' UNION ALL
SELECT '2019-12-03 16:30:29.970 UTC', '2019-12-03 18:44:18.972 UTC', 'CCCCC', 'UWI4UII2J4' UNION ALL
SELECT '2019-12-04 17:32:19.285 UTC', '2019-12-04 17:32:24.668 UTC', 'CCCCC', 'G3247ROIUH'
)
SELECT MIN(Min_Timestamp) AS Min_Timestamp, MAX(Max_Timestamp) AS Max_Timestamp, Device_ID, Session_ID
FROM (
SELECT * EXCEPT(flag, Session_ID),
CONCAT(Device_ID, CAST(COUNTIF(flag) OVER(PARTITION BY Device_ID ORDER BY Max_Timestamp) AS STRING)) AS Session_ID
FROM (
SELECT *,
IFNULL(TIMESTAMP_DIFF(Min_Timestamp, LAG(Max_Timestamp) OVER(PARTITION BY Device_ID ORDER BY Max_Timestamp), SECOND), 999) > 60 flag
FROM `project.dataset.table`
)
)
GROUP BY Device_ID, Session_ID
-- ORDER BY Device_ID, Session_ID
输出
Row Min_Timestamp Max_Timestamp Device_ID Session_ID
1 2019-12-03 23:05:30.416 UTC 2019-12-03 23:09:53.353 UTC AAAAA AAAAA1
2 2019-12-03 00:00:28.933 UTC 2019-12-03 00:23:26.837 UTC BBBBB BBBBB1
3 2019-12-03 17:00:32.160 UTC 2019-12-03 17:17:12.408 UTC BBBBB BBBBB2
4 2019-12-03 17:18:24.528 UTC 2019-12-03 17:18:27.516 UTC BBBBB BBBBB3
5 2019-12-03 16:30:29.970 UTC 2019-12-03 18:44:18.972 UTC CCCCC CCCCC1
6 2019-12-04 17:32:19.285 UTC 2019-12-04 17:32:24.668 UTC CCCCC CCCCC2
我的场景要求我将间隔小于 60 秒的会话视为同一会话。
数据如下。
Min_Timestamp Max_Timestamp Device_ID Session_ID Prev_Max_Timestamp Diff_Sec
2019-12-03 23:05:30.416 UTC 2019-12-03 23:09:13.502 UTC AAAAA I90HYTRFJI null null
2019-12-03 23:09:21.517 UTC 2019-12-03 23:09:53.353 UTC AAAAA 98UHIGSNJR 2019-12-03 23:09:13.502 UTC 8
2019-12-03 00:00:28.933 UTC 2019-12-03 00:09:03.473 UTC BBBBB 32QE8Y76TG null null
2019-12-03 00:09:19.106 UTC 2019-12-03 00:23:26.554 UTC BBBBB R4GUY432AD 2019-12-03 00:09:03.473 UTC 16
2019-12-03 00:23:26.818 UTC 2019-12-03 00:23:26.837 UTC BBBBB E32GUYE328 2019-12-03 00:23:26.554 UTC 0
2019-12-03 17:00:32.160 UTC 2019-12-03 17:03:48.758 UTC BBBBB GY1EW32876 2019-12-03 00:23:26.837 UTC 59825
2019-12-03 17:03:58.069 UTC 2019-12-03 17:17:12.408 UTC BBBBB 2876T128Y7 2019-12-03 17:03:48.758 UTC 9
2019-12-03 17:18:24.528 UTC 2019-12-03 17:18:27.516 UTC BBBBB 098U6598U5 2019-12-03 17:17:12.408 UTC 73
2019-12-03 16:30:29.970 UTC 2019-12-03 18:44:18.972 UTC CCCCC UWI4UII2J4 null null
2019-12-04 17:32:19.285 UTC 2019-12-04 17:32:24.668 UTC CCCCC G3247ROIUH 2019-12-03 18:44:18.972 UTC 82080
将间隔小于 60 秒但仍按设备分开的会话组合在一起。它看起来像这样。
Min_Timestamp Max_Timestamp Device_ID Session_ID Prev_Max_Timestamp Diff_Sec
2019-12-03 23:05:30.416 UTC 2019-12-03 23:09:13.502 UTC AAAAA I90HYTRFJI null null
2019-12-03 23:09:21.517 UTC 2019-12-03 23:09:53.353 UTC AAAAA 98UHIGSNJR 2019-12-03 23:09:13.502 UTC 8
2019-12-03 00:00:28.933 UTC 2019-12-03 00:09:03.473 UTC BBBBB 32QE8Y76TG null null
2019-12-03 00:09:19.106 UTC 2019-12-03 00:23:26.554 UTC BBBBB R4GUY432AD 2019-12-03 00:09:03.473 UTC 16
2019-12-03 00:23:26.818 UTC 2019-12-03 00:23:26.837 UTC BBBBB E32GUYE328 2019-12-03 00:23:26.554 UTC 0
2019-12-03 17:00:32.160 UTC 2019-12-03 17:03:48.758 UTC BBBBB GY1EW32876 2019-12-03 00:23:26.837 UTC 59825
2019-12-03 17:03:58.069 UTC 2019-12-03 17:17:12.408 UTC BBBBB 2876T128Y7 2019-12-03 17:03:48.758 UTC 9
2019-12-03 17:18:24.528 UTC 2019-12-03 17:18:27.516 UTC BBBBB 098U6598U5 2019-12-03 17:17:12.408 UTC 73
2019-12-03 16:30:29.970 UTC 2019-12-03 18:44:18.972 UTC CCCCC UWI4UII2J4 null null
2019-12-04 17:32:19.285 UTC 2019-12-04 17:32:24.668 UTC CCCCC G3247ROIUH 2019-12-03 18:44:18.972 UTC 82080
我希望能够得到像这样的东西。 Session_ID
不需要像A1、B1、C1等,可以简单的作为session的第一个值。请注意,最新的 Max_Timestamp
现在是新的 Max_Timestamp
。
Min_Timestamp Max_Timestamp Device_ID Session_ID
2019-12-03 23:05:30.416 UTC 2019-12-03 23:09:53.353 UTC AAAAA A1
2019-12-03 00:00:28.933 UTC 2019-12-03 00:23:26.837 UTC BBBBB B1
2019-12-03 17:00:32.160 UTC 2019-12-03 17:18:27.516 UTC BBBBB B2
2019-12-03 16:30:29.970 UTC 2019-12-03 18:44:18.972 UTC CCCCC C1
2019-12-04 17:32:19.285 UTC 2019-12-04 17:32:24.668 UTC CCCCC C2
我的想法是让属于同一组的所有 Session_ID
都相同。然后按 Device_ID
和 Session_ID
分组得到 min(Min_Timestamp)
和 max(Max_Timestamp).
我尝试 fiddle 和 Session_ID
上的 first_value()
,但我不知道如何正确分区。
最好在旧版本中实现这一点。如果没有,标准也可以。
以下适用于 BigQuery Standard SQL(如果您想要 - 只需 "translate" 到 Legacy - 但建议无论如何都要迁移到 Standard!!!所以现在就做并在下面使用)
#standardSQL
SELECT MIN(Min_Timestamp) AS Min_Timestamp, MAX(Max_Timestamp) AS Max_Timestamp, Device_ID, Session_ID
FROM (
SELECT * EXCEPT(flag, Session_ID),
CONCAT(Device_ID, CAST(COUNTIF(flag) OVER(PARTITION BY Device_ID ORDER BY Max_Timestamp) AS STRING)) AS Session_ID
FROM (
SELECT *,
IFNULL(TIMESTAMP_DIFF(Min_Timestamp, LAG(Max_Timestamp) OVER(PARTITION BY Device_ID ORDER BY Max_Timestamp), SECOND), 999) > 60 flag
FROM `project.dataset.table`
)
)
GROUP BY Device_ID, Session_ID
您可以使用您问题中的示例数据来测试和使用上面的示例,如下例所示
#standardSQL
WITH `project.dataset.table` AS (
SELECT TIMESTAMP '2019-12-03 23:05:30.416 UTC' Min_Timestamp, TIMESTAMP '2019-12-03 23:09:13.502 UTC' Max_Timestamp, 'AAAAA' Device_ID, 'I90HYTRFJI' Session_ID UNION ALL
SELECT '2019-12-03 23:09:21.517 UTC', '2019-12-03 23:09:53.353 UTC', 'AAAAA', '98UHIGSNJR' UNION ALL
SELECT '2019-12-03 00:00:28.933 UTC', '2019-12-03 00:09:03.473 UTC', 'BBBBB', '32QE8Y76TG' UNION ALL
SELECT '2019-12-03 00:09:19.106 UTC', '2019-12-03 00:23:26.554 UTC', 'BBBBB', 'R4GUY432AD' UNION ALL
SELECT '2019-12-03 00:23:26.818 UTC', '2019-12-03 00:23:26.837 UTC', 'BBBBB', 'E32GUYE328' UNION ALL
SELECT '2019-12-03 17:00:32.160 UTC', '2019-12-03 17:03:48.758 UTC', 'BBBBB', 'GY1EW32876' UNION ALL
SELECT '2019-12-03 17:03:58.069 UTC', '2019-12-03 17:17:12.408 UTC', 'BBBBB', '2876T128Y7' UNION ALL
SELECT '2019-12-03 17:18:24.528 UTC', '2019-12-03 17:18:27.516 UTC', 'BBBBB', '098U6598U5' UNION ALL
SELECT '2019-12-03 16:30:29.970 UTC', '2019-12-03 18:44:18.972 UTC', 'CCCCC', 'UWI4UII2J4' UNION ALL
SELECT '2019-12-04 17:32:19.285 UTC', '2019-12-04 17:32:24.668 UTC', 'CCCCC', 'G3247ROIUH'
)
SELECT MIN(Min_Timestamp) AS Min_Timestamp, MAX(Max_Timestamp) AS Max_Timestamp, Device_ID, Session_ID
FROM (
SELECT * EXCEPT(flag, Session_ID),
CONCAT(Device_ID, CAST(COUNTIF(flag) OVER(PARTITION BY Device_ID ORDER BY Max_Timestamp) AS STRING)) AS Session_ID
FROM (
SELECT *,
IFNULL(TIMESTAMP_DIFF(Min_Timestamp, LAG(Max_Timestamp) OVER(PARTITION BY Device_ID ORDER BY Max_Timestamp), SECOND), 999) > 60 flag
FROM `project.dataset.table`
)
)
GROUP BY Device_ID, Session_ID
-- ORDER BY Device_ID, Session_ID
输出
Row Min_Timestamp Max_Timestamp Device_ID Session_ID
1 2019-12-03 23:05:30.416 UTC 2019-12-03 23:09:53.353 UTC AAAAA AAAAA1
2 2019-12-03 00:00:28.933 UTC 2019-12-03 00:23:26.837 UTC BBBBB BBBBB1
3 2019-12-03 17:00:32.160 UTC 2019-12-03 17:17:12.408 UTC BBBBB BBBBB2
4 2019-12-03 17:18:24.528 UTC 2019-12-03 17:18:27.516 UTC BBBBB BBBBB3
5 2019-12-03 16:30:29.970 UTC 2019-12-03 18:44:18.972 UTC CCCCC CCCCC1
6 2019-12-04 17:32:19.285 UTC 2019-12-04 17:32:24.668 UTC CCCCC CCCCC2