SQL - 如何根据子组内的其他值有效地对子组中的记录进行分类?

SQL - how to efficiently categorize records in sub groups based on other values within the sub group?

我有一个 table 如下所示。记录按 user_id 和 event_time.

排序
Row    User_ID    Event_Time    Event_Type    
1      1          2020-01-01    View
2      1          2020-01-02    Click
3      1          2020-01-03    Purchase
4      2          2020-02-01    View
5      2          2020-02-02    Click
6      2          2020-02-03    View
7      2          2020-02-04    Purchase
8      2          2020-02-11    View
9      2          2020-02-12    Purchase
10     2          2020-02-21    View
11     2          2020-02-22    Click
12     2          2020-02-23    Purchase
13     2          2020-02-27    View
14     2          2020-02-28    Click
15     3          2020-03-01    View
16     3          2020-03-02    Purchase
...

我想添加一个名为路径的新列来对非购买事件进行分类。一个用户"belongs"的每个非购买事件到紧接着发生的同一用户的立即购买事件,这意味着它们可以被视为一个子组。在每个子组中:

  • 第一个非购买事件是介绍人(第1、4、10行)
  • 最后一个非购买事件是Closer(第2、6、11行)
  • 介绍人和 Closer 之间的所有非购买事件都是 影响者(第 5 行)
  • 如果一个购买事件只有一个非购买事件与之分组,则非购买事件是(第 8、15 行)
  • 购买事件填NULL(第3、7、9、12、16行)
  • 如果非购买事件不属于任何购买事件(第 13、14 行),则填写 NULL

所以 table 添加列后应该如下所示:

Row    User_ID    Event_Time    Event_Type    Path
1      1          2020-01-01    View          Introducer
2      1          2020-01-02    Click         Closer
3      1          2020-01-03    Purchase      NULL
4      2          2020-02-01    View          Introducer
5      2          2020-02-02    Click         Influencer
6      2          2020-02-03    View          Closer
7      2          2020-02-04    Purchase      NULL
8      2          2020-02-11    View          Only
9      2          2020-02-12    Purchase      NULL
10     2          2020-02-21    View          Introducer
11     2          2020-02-22    Click         Closer
12     2          2020-02-23    Purchase      NULL
13     2          2020-02-27    View          NULL
14     2          2020-02-28    Click         NULL
15     3          2020-03-01    View          Only
16     3          2020-03-02    Purchase      NULL
...

如果我自行加入并添加一个新列以帮助确定用户最后一次购买每个事件的时间,则解决方案很简单。但是,我有超过 1 亿条记录,self-join 效率不够。执行最终会超时。所以我的问题是,是否有更有效的方法来添加这个新列?我正在考虑使用相关查询,但似乎无法理解它。

如果您使用的 DBMS 支持 window 函数,您可以使用几个 CTE 首先将行拆分为不同的购买,然后找到与每个购买相关的行号,然后最后根据您给出的条件计算 Path

WITH purchases AS (
  SELECT "Row", User_ID, Event_Time, Event_Type,
         COALESCE(SUM(CASE WHEN Event_Type = 'Purchase' THEN 1 ELSE 0 END) OVER
           (PARTITION BY User_ID ORDER BY Event_Time ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING), 0) AS pnum
  FROM events
),
prows AS (
  SELECT "Row", User_ID, Event_Time, Event_Type, pnum,
         ROW_NUMBER() OVER (PARTITION BY User_ID, pnum ORDER BY Event_Time) AS rn,
         ROW_NUMBER() OVER (PARTITION BY User_ID, pnum ORDER BY Event_Time DESC) AS drn
  FROM purchases
)
SELECT "Row", User_ID, Event_Time, Event_Type,
       CASE WHEN Event_Type = 'Purchase' OR
                 NOT EXISTS (SELECT * 
                             FROM prows r2 
                             WHERE r2.User_ID = r1.User_ID
                               AND r2.pnum = r1.pnum
                               AND r2.Event_Type = 'Purchase') THEN NULL
            WHEN rn = 1 AND drn = 2 THEN 'Only'
            WHEN rn = 1 THEN 'Introducer'
            WHEN drn = 2 THEN 'Closer'
            ELSE 'Influencer'
       END AS Path
FROM prows r1
ORDER BY User_ID, Event_Time

输出:

Row     User_ID     Event_Time  Event_Type  Path
1       1           2020-01-01  View        Introducer
2       1           2020-01-02  Click       Closer
3       1           2020-01-03  Purchase    (null)
4       2           2020-02-01  View        Introducer
5       2           2020-02-02  Click       Influencer
6       2           2020-02-03  View        Closer
7       2           2020-02-04  Purchase    (null)
8       2           2020-02-11  View        Only
9       2           2020-02-12  Purchase    (null)
10      2           2020-02-21  View        Introducer
11      2           2020-02-22  Click       Closer
12      2           2020-02-23  Purchase    (null)
13      2           2020-02-27  View        (null)
14      2           2020-02-28  Click       (null)
15      3           2020-03-01  View        Only
16      3           2020-03-02  Purchase    (null)

SQL Server demo on SQLFiddle。同样的查询也将 运行 在 PostgreSQL 和 Oracle 上。

这与 Nick 的做法类似,但我认为逻辑更简单:

WITH e AS (
      SELECT e.*,
             SUM(CASE WHEN Event_Type = 'Purchase' THEN 1 ELSE 0 END) OVER
                 (PARTITION BY User_ID ORDER BY Event_Time DESC) AS grp
      FROM events e
     ),
     en as (
      SELECT e.*,
             COUNT(*) OVER (PARTITION BY user_id, grp) as cnt,
             ROW_NUMBER() OVER (PARTITION BY user_id, grp ORDER BY Event_Time) as seqnum
      FROM e
     )
SELECT en.*,
       (CASE WHEN grp = 0                   -- no purchase event
             THEN NULL 
             WHEN Event_Type = 'Purchase'   -- the event itself
             THEN NULL
             WHEN seqnum = 1 AND cnt = 2    -- the special case of "ONLY" 
             THEN 'Only'
             WHEN seqnum = 1                -- The first event
             THEN 'Introducer'
             WHEN seqnum = cnt - 1          -- The penultimate event
             THEN 'Closer'
             ELSE 'Influencer'
        END) as Path
FROM en
ORDER BY User_ID, Event_Time;

特别是外查询中的子查询是不必要的。 grp = 0 找到最后一组可能没有购买的事件。我还认为根据事件总数和顺序计数器来编写逻辑更容易。

Here 是一个 db<>fiddle.