高效查询以从 sql 中的事件日志 table 获取步骤持续时间到累积快照事实

Efficient query to get step duration from an event logs table in sql into an accumulating snapshot fact

此示例是在 SQL Server 2016 中构建的,但它也应该适用于 MySQL 8.X.

我将事件日志数据存储在 table fact_user_event_activity 中,示例数据如下:

event_date_key  user_key    step_key    session_id  event_timestamp
20140411        123         1           1000        2014-04-11 08:00:00.000
20140411        123         2           1000        2014-04-11 08:10:00.000
20140411        123         3           1000        2014-04-11 08:20:00.000
20140411        123         4           1000        2014-04-11 08:30:00.000
20140411        125         1           1001        2014-04-11 09:10:00.000
20140411        123         5           1000        2014-04-11 08:31:00.000
20140411        125         2           1001        2014-04-11 09:30:00.000
20140411        125         3           1001        2014-04-11 09:50:00.000  <-- 
20140411        125         3           1001        2014-04-11 09:51:00.000  <--
20140411        125         4           1001        2014-04-11 09:52:00.000

假设

预计

查询以下内容最有效的方法是什么?

user_key     session_id    step_1_duration_mins    step_2_duration_mins     step_3_duration_mins    step_4_duration_mins
123             1000           10                         10                       10                    1
125             1001           20                         20                        2                 NULL

这将用作累积快照的 ETL 查询

设置

DROP TABLE IF EXISTS  [fact_user_event_activity]
;
CREATE TABLE [fact_user_event_activity] (
  [event_date_key] INT DEFAULT NULL,
  [user_key] BIGINT NOT NULL,
  [step_key] BIGINT NOT NULL,
  [session_id] BIGINT NOT NULL,
  [event_timestamp] datetime NOT NULL
)
;
INSERT INTO [fact_user_event_activity]
VALUES (20140411, 123, 1, 1000, N'2014-04-11 08:00:00'),
(20140411, 123, 2, 1000, N'2014-04-11 08:10:00'),
(20140411, 123, 3, 1000, N'2014-04-11 08:20:00'),
(20140411, 123, 4, 1000, N'2014-04-11 08:30:00'),
(20140411, 125, 1, 1001, N'2014-04-11 09:10:00'),
(20140411, 123, 5, 1000, N'2014-04-11 08:31:00'),
(20140411, 125, 2, 1001, N'2014-04-11 09:30:00'),
(20140411, 125, 3, 1001, N'2014-04-11 09:50:00'),
(20140411, 125, 3, 1001, N'2014-04-11 09:51:00'),
(20140411, 125, 4, 1001, N'2014-04-11 09:52:00'),
(20140411, 129, 1, 1005, N'2014-04-11 09:08:00'),
(20140411, 129, 2, 1005, N'2014-04-11 09:10:00'),
(20140411, 129, 3, 1005, N'2014-04-11 09:12:00'),
(20140411, 129, 3, 1005, N'2014-04-11 09:13:00'),
(20140411, 129, 4, 1005, N'2014-04-11 09:14:00'),
(20140411, 129, 5, 1005, N'2014-04-11 09:18:00')
;

我的尝试

为了便于理解代码,我分两步进行处理:

  1. 从开始(会话开始)获取每一步的持续时间
  2. 计算每一步的差值duration_from_start

这 returns 我所期待的,但我确信我可能把事情过于复杂了,这将 运行 反对 ~ 500 M 记录,所以我想知道是否有更好的方法,或者如果我遗漏了什么。

-- Step 1
-- to improve performance, use temp table instead of CTE
-- Use TIMESTAMPDIFF in MySQL instead of DATEDIFF
WITH durations_from_start_tmp AS
    (
    SELECT session_id, user_key, FIRST_VALUE(fuea.event_timestamp) OVER(PARTITION BY user_key, fuea.session_id ORDER BY fuea.event_timestamp) first_login,
    DENSE_RANK() OVER(PARTITION BY user_key, step_key, fuea.session_id ORDER BY fuea.event_timestamp) AS rnk,
    CASE WHEN step_key = 2 THEN DATEDIFF(MINUTE, FIRST_VALUE(fuea.event_timestamp) OVER(PARTITION BY user_key, fuea.session_id ORDER BY fuea.event_timestamp), fuea.event_timestamp) END AS step_1_duration_from_start,
    CASE WHEN step_key = 3 THEN DATEDIFF(MINUTE, FIRST_VALUE(fuea.event_timestamp) OVER(PARTITION BY user_key, fuea.session_id ORDER BY fuea.event_timestamp), fuea.event_timestamp) END AS step_2_duration_from_start,
    CASE WHEN step_key = 4 THEN DATEDIFF(MINUTE, FIRST_VALUE(fuea.event_timestamp) OVER(PARTITION BY user_key, fuea.session_id ORDER BY fuea.event_timestamp), fuea.event_timestamp) END AS step_3_duration_from_start,
    CASE WHEN step_key = 5 THEN DATEDIFF(MINUTE, FIRST_VALUE(fuea.event_timestamp) OVER(PARTITION BY user_key, fuea.session_id ORDER BY fuea.event_timestamp), fuea.event_timestamp) END AS step_4_duration_from_start
    FROM [fact_user_event_activity] fuea
    --WHERE event_timestamp > watermark --for incremental load
    )

-- Step 2
SELECT user_key, session_id, SUM(step_1_duration_from_start) AS step_1_duration_mins,
 SUM(step_2_duration_from_start) - SUM(step_1_duration_from_start) AS step_2_duration_mins ,
 SUM(step_3_duration_from_start) - SUM(step_2_duration_from_start) AS step_3_duration_mins ,
 SUM(step_4_duration_from_start) - SUM(step_3_duration_from_start) AS step_4_duration_mins
 FROM durations_from_start_tmp
 -- deals with repeated steps
 WHERE rnk = 1
 GROUP BY  user_key, session_id

参考资料

这可能与获得答案无关,但以防万一您不熟悉数据建模概念

Accumulating Snapshots Definition

因此,您可能采取的一种方法是添加一个索引(假设您可以添加一个),例如:

CREATE INDEX [SomeIndexName] ON [fact_user_event_activity] (user_key, session_id, step_key, event_timestamp);

(或者,如果您担心 500m 行的索引大小,您可以在 step_key、event_timestamp 上进行包含。)

然后跳过将 window 函数与如下查询一起使用:

SELECT user_key,
       session_id,
       step_1_duration = DATEDIFF(MINUTE, step_1_timestamp, step_2_timestamp),
       step_2_duration = DATEDIFF(MINUTE, step_2_timestamp, step_3_timestamp),
       step_3_duration = DATEDIFF(MINUTE, step_3_timestamp, step_4_timestamp),
       step_4_duration = DATEDIFF(MINUTE, step_4_timestamp, step_5_timestamp)
FROM 
(
    SELECT user_key, session_id,
           step_1_timestamp = MIN(CASE WHEN step_key = 1 THEN event_timestamp END),
           step_2_timestamp = MIN(CASE WHEN step_key = 2 THEN event_timestamp END),
           step_3_timestamp = MIN(CASE WHEN step_key = 3 THEN event_timestamp END),
           step_4_timestamp = MIN(CASE WHEN step_key = 4 THEN event_timestamp END),
           step_5_timestamp = MIN(CASE WHEN step_key = 5 THEN event_timestamp END)
    FROM fact_user_event_activity
    GROUP BY user_key, session_id
) AS T;

(理论上只进行索引扫描而无需任何排序。)