SQL 通过两列生成会话 ID
SQL Session ID generation by two columns
我正在通过 SQL 为包含用户、组和事件时间的 table 生成会话 ID。会话定义为 window 10 分钟。我当前的实现生成会话 ID;但是,有一点需要注意,用户可能属于多个组,并且不会反映在会话 ID 分类中
示例架构:
userid | group | event_time
001 A 2020-06-20 02:04:50.000
001 A. 2020-06-20 02:06:12.000
001. A 2020-06-20 02:17:16.000
001. B. 2020-06-20 02:20:10.000
001. A. 2020-06-20 02:28:13.000
002. A. 2020-06-20 04:13:97.000
SQL 片段:
tmp_table AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY
userid
ORDER BY
event_time
) AS user_row,
LAG(userid) OVER (
PARTITION BY
userid
ORDER BY
event_time
) AS prev_user,
LAG(event_time) OVER (
PARTITION BY userid
ORDER BY
event_time
) AS prev_action
FROM table
ORDER BY
userid,
event_time
)
SELECT
*,
CASE
WHEN prev_user = user_row AND DATE_DIFF('minute', prev_action, event_time) < 10
THEN LAG(user_row) OVER (
PARTITION BY
userid
ORDER BY
user_row
)
ELSE user_row
END AS session_id
FROM tmp_table
但是,这会产生
userid | group | event_time | session_id
001 A 2020-06-20 02:04:50.000. 1
001 A. 2020-06-20 02:06:12.000. 1
001. A 2020-06-20 02:17:16.000. 2
001. B. 2020-06-20 02:20:10.000. 2
001. A. 2020-06-20 02:28:13.000. 2
002. A. 2020-06-20 04:13:97.000. 1
什么时候应该
userid | group | event_time | session_id
001 A 2020-06-20 02:04:50.000. 1
001 A. 2020-06-20 02:06:12.000. 1
001. A 2020-06-20 02:17:16.000. 2
001. B. 2020-06-20 02:20:10.000. 1
001. A. 2020-06-20 02:28:13.000. 3
002. A. 2020-06-20 04:13:97.000. 1
因为userid 001同时属于A和B,A和B发生的事情是相互独立的。
您可以简化会话的计算。只需查看每个 userid
/group
组合的上一个事件时间。然后当差值大于等于10时开始新的session:
WITH tmp_table AS (
SELECT t.*,
LAG(event_time) OVER (PARTITION BY userid, group ORDER BY event_time) as prev_event_time
FROM table t
)
SELECT t.*,
SUM(CASE WHEN DATE_DIFF('minute', prev_event_time, event_time) < 10
THEN 0 ELSE 1
END) OVER (PARTITION BY userid ORDER BY event_time)
FROM tmp_table t;
我不确定您的代码应该如何工作。但是如果你想为每个组重新计数,我希望 group
在 partition by
中。
我正在通过 SQL 为包含用户、组和事件时间的 table 生成会话 ID。会话定义为 window 10 分钟。我当前的实现生成会话 ID;但是,有一点需要注意,用户可能属于多个组,并且不会反映在会话 ID 分类中
示例架构:
userid | group | event_time
001 A 2020-06-20 02:04:50.000
001 A. 2020-06-20 02:06:12.000
001. A 2020-06-20 02:17:16.000
001. B. 2020-06-20 02:20:10.000
001. A. 2020-06-20 02:28:13.000
002. A. 2020-06-20 04:13:97.000
SQL 片段:
tmp_table AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY
userid
ORDER BY
event_time
) AS user_row,
LAG(userid) OVER (
PARTITION BY
userid
ORDER BY
event_time
) AS prev_user,
LAG(event_time) OVER (
PARTITION BY userid
ORDER BY
event_time
) AS prev_action
FROM table
ORDER BY
userid,
event_time
)
SELECT
*,
CASE
WHEN prev_user = user_row AND DATE_DIFF('minute', prev_action, event_time) < 10
THEN LAG(user_row) OVER (
PARTITION BY
userid
ORDER BY
user_row
)
ELSE user_row
END AS session_id
FROM tmp_table
但是,这会产生
userid | group | event_time | session_id
001 A 2020-06-20 02:04:50.000. 1
001 A. 2020-06-20 02:06:12.000. 1
001. A 2020-06-20 02:17:16.000. 2
001. B. 2020-06-20 02:20:10.000. 2
001. A. 2020-06-20 02:28:13.000. 2
002. A. 2020-06-20 04:13:97.000. 1
什么时候应该
userid | group | event_time | session_id
001 A 2020-06-20 02:04:50.000. 1
001 A. 2020-06-20 02:06:12.000. 1
001. A 2020-06-20 02:17:16.000. 2
001. B. 2020-06-20 02:20:10.000. 1
001. A. 2020-06-20 02:28:13.000. 3
002. A. 2020-06-20 04:13:97.000. 1
因为userid 001同时属于A和B,A和B发生的事情是相互独立的。
您可以简化会话的计算。只需查看每个 userid
/group
组合的上一个事件时间。然后当差值大于等于10时开始新的session:
WITH tmp_table AS (
SELECT t.*,
LAG(event_time) OVER (PARTITION BY userid, group ORDER BY event_time) as prev_event_time
FROM table t
)
SELECT t.*,
SUM(CASE WHEN DATE_DIFF('minute', prev_event_time, event_time) < 10
THEN 0 ELSE 1
END) OVER (PARTITION BY userid ORDER BY event_time)
FROM tmp_table t;
我不确定您的代码应该如何工作。但是如果你想为每个组重新计数,我希望 group
在 partition by
中。