SQL - 如何根据子组内的其他值有效地对子组中的记录进行分类?
SQL - how to efficiently categorize records in sub groups based on other values within the sub group?
我有一个 table 如下所示。记录按 user_id 和 event_time.
排序
Row User_ID Event_Time Event_Type
1 1 2020-01-01 View
2 1 2020-01-02 Click
3 1 2020-01-03 Purchase
4 2 2020-02-01 View
5 2 2020-02-02 Click
6 2 2020-02-03 View
7 2 2020-02-04 Purchase
8 2 2020-02-11 View
9 2 2020-02-12 Purchase
10 2 2020-02-21 View
11 2 2020-02-22 Click
12 2 2020-02-23 Purchase
13 2 2020-02-27 View
14 2 2020-02-28 Click
15 3 2020-03-01 View
16 3 2020-03-02 Purchase
...
我想添加一个名为路径的新列来对非购买事件进行分类。一个用户"belongs"的每个非购买事件到紧接着发生的同一用户的立即购买事件,这意味着它们可以被视为一个子组。在每个子组中:
- 第一个非购买事件是介绍人(第1、4、10行)
- 最后一个非购买事件是Closer(第2、6、11行)
- 介绍人和 Closer 之间的所有非购买事件都是 影响者(第 5 行)
- 如果一个购买事件只有一个非购买事件与之分组,则非购买事件是仅(第 8、15 行)
- 购买事件填NULL(第3、7、9、12、16行)
- 如果非购买事件不属于任何购买事件(第 13、14 行),则填写 NULL
所以 table 添加列后应该如下所示:
Row User_ID Event_Time Event_Type Path
1 1 2020-01-01 View Introducer
2 1 2020-01-02 Click Closer
3 1 2020-01-03 Purchase NULL
4 2 2020-02-01 View Introducer
5 2 2020-02-02 Click Influencer
6 2 2020-02-03 View Closer
7 2 2020-02-04 Purchase NULL
8 2 2020-02-11 View Only
9 2 2020-02-12 Purchase NULL
10 2 2020-02-21 View Introducer
11 2 2020-02-22 Click Closer
12 2 2020-02-23 Purchase NULL
13 2 2020-02-27 View NULL
14 2 2020-02-28 Click NULL
15 3 2020-03-01 View Only
16 3 2020-03-02 Purchase NULL
...
如果我自行加入并添加一个新列以帮助确定用户最后一次购买每个事件的时间,则解决方案很简单。但是,我有超过 1 亿条记录,self-join 效率不够。执行最终会超时。所以我的问题是,是否有更有效的方法来添加这个新列?我正在考虑使用相关查询,但似乎无法理解它。
如果您使用的 DBMS 支持 window 函数,您可以使用几个 CTE 首先将行拆分为不同的购买,然后找到与每个购买相关的行号,然后最后根据您给出的条件计算 Path
:
WITH purchases AS (
SELECT "Row", User_ID, Event_Time, Event_Type,
COALESCE(SUM(CASE WHEN Event_Type = 'Purchase' THEN 1 ELSE 0 END) OVER
(PARTITION BY User_ID ORDER BY Event_Time ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING), 0) AS pnum
FROM events
),
prows AS (
SELECT "Row", User_ID, Event_Time, Event_Type, pnum,
ROW_NUMBER() OVER (PARTITION BY User_ID, pnum ORDER BY Event_Time) AS rn,
ROW_NUMBER() OVER (PARTITION BY User_ID, pnum ORDER BY Event_Time DESC) AS drn
FROM purchases
)
SELECT "Row", User_ID, Event_Time, Event_Type,
CASE WHEN Event_Type = 'Purchase' OR
NOT EXISTS (SELECT *
FROM prows r2
WHERE r2.User_ID = r1.User_ID
AND r2.pnum = r1.pnum
AND r2.Event_Type = 'Purchase') THEN NULL
WHEN rn = 1 AND drn = 2 THEN 'Only'
WHEN rn = 1 THEN 'Introducer'
WHEN drn = 2 THEN 'Closer'
ELSE 'Influencer'
END AS Path
FROM prows r1
ORDER BY User_ID, Event_Time
输出:
Row User_ID Event_Time Event_Type Path
1 1 2020-01-01 View Introducer
2 1 2020-01-02 Click Closer
3 1 2020-01-03 Purchase (null)
4 2 2020-02-01 View Introducer
5 2 2020-02-02 Click Influencer
6 2 2020-02-03 View Closer
7 2 2020-02-04 Purchase (null)
8 2 2020-02-11 View Only
9 2 2020-02-12 Purchase (null)
10 2 2020-02-21 View Introducer
11 2 2020-02-22 Click Closer
12 2 2020-02-23 Purchase (null)
13 2 2020-02-27 View (null)
14 2 2020-02-28 Click (null)
15 3 2020-03-01 View Only
16 3 2020-03-02 Purchase (null)
SQL Server demo on SQLFiddle。同样的查询也将 运行 在 PostgreSQL 和 Oracle 上。
这与 Nick 的做法类似,但我认为逻辑更简单:
WITH e AS (
SELECT e.*,
SUM(CASE WHEN Event_Type = 'Purchase' THEN 1 ELSE 0 END) OVER
(PARTITION BY User_ID ORDER BY Event_Time DESC) AS grp
FROM events e
),
en as (
SELECT e.*,
COUNT(*) OVER (PARTITION BY user_id, grp) as cnt,
ROW_NUMBER() OVER (PARTITION BY user_id, grp ORDER BY Event_Time) as seqnum
FROM e
)
SELECT en.*,
(CASE WHEN grp = 0 -- no purchase event
THEN NULL
WHEN Event_Type = 'Purchase' -- the event itself
THEN NULL
WHEN seqnum = 1 AND cnt = 2 -- the special case of "ONLY"
THEN 'Only'
WHEN seqnum = 1 -- The first event
THEN 'Introducer'
WHEN seqnum = cnt - 1 -- The penultimate event
THEN 'Closer'
ELSE 'Influencer'
END) as Path
FROM en
ORDER BY User_ID, Event_Time;
特别是外查询中的子查询是不必要的。 grp = 0
找到最后一组可能没有购买的事件。我还认为根据事件总数和顺序计数器来编写逻辑更容易。
Here 是一个 db<>fiddle.
我有一个 table 如下所示。记录按 user_id 和 event_time.
排序Row User_ID Event_Time Event_Type
1 1 2020-01-01 View
2 1 2020-01-02 Click
3 1 2020-01-03 Purchase
4 2 2020-02-01 View
5 2 2020-02-02 Click
6 2 2020-02-03 View
7 2 2020-02-04 Purchase
8 2 2020-02-11 View
9 2 2020-02-12 Purchase
10 2 2020-02-21 View
11 2 2020-02-22 Click
12 2 2020-02-23 Purchase
13 2 2020-02-27 View
14 2 2020-02-28 Click
15 3 2020-03-01 View
16 3 2020-03-02 Purchase
...
我想添加一个名为路径的新列来对非购买事件进行分类。一个用户"belongs"的每个非购买事件到紧接着发生的同一用户的立即购买事件,这意味着它们可以被视为一个子组。在每个子组中:
- 第一个非购买事件是介绍人(第1、4、10行)
- 最后一个非购买事件是Closer(第2、6、11行)
- 介绍人和 Closer 之间的所有非购买事件都是 影响者(第 5 行)
- 如果一个购买事件只有一个非购买事件与之分组,则非购买事件是仅(第 8、15 行)
- 购买事件填NULL(第3、7、9、12、16行)
- 如果非购买事件不属于任何购买事件(第 13、14 行),则填写 NULL
所以 table 添加列后应该如下所示:
Row User_ID Event_Time Event_Type Path
1 1 2020-01-01 View Introducer
2 1 2020-01-02 Click Closer
3 1 2020-01-03 Purchase NULL
4 2 2020-02-01 View Introducer
5 2 2020-02-02 Click Influencer
6 2 2020-02-03 View Closer
7 2 2020-02-04 Purchase NULL
8 2 2020-02-11 View Only
9 2 2020-02-12 Purchase NULL
10 2 2020-02-21 View Introducer
11 2 2020-02-22 Click Closer
12 2 2020-02-23 Purchase NULL
13 2 2020-02-27 View NULL
14 2 2020-02-28 Click NULL
15 3 2020-03-01 View Only
16 3 2020-03-02 Purchase NULL
...
如果我自行加入并添加一个新列以帮助确定用户最后一次购买每个事件的时间,则解决方案很简单。但是,我有超过 1 亿条记录,self-join 效率不够。执行最终会超时。所以我的问题是,是否有更有效的方法来添加这个新列?我正在考虑使用相关查询,但似乎无法理解它。
如果您使用的 DBMS 支持 window 函数,您可以使用几个 CTE 首先将行拆分为不同的购买,然后找到与每个购买相关的行号,然后最后根据您给出的条件计算 Path
:
WITH purchases AS (
SELECT "Row", User_ID, Event_Time, Event_Type,
COALESCE(SUM(CASE WHEN Event_Type = 'Purchase' THEN 1 ELSE 0 END) OVER
(PARTITION BY User_ID ORDER BY Event_Time ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING), 0) AS pnum
FROM events
),
prows AS (
SELECT "Row", User_ID, Event_Time, Event_Type, pnum,
ROW_NUMBER() OVER (PARTITION BY User_ID, pnum ORDER BY Event_Time) AS rn,
ROW_NUMBER() OVER (PARTITION BY User_ID, pnum ORDER BY Event_Time DESC) AS drn
FROM purchases
)
SELECT "Row", User_ID, Event_Time, Event_Type,
CASE WHEN Event_Type = 'Purchase' OR
NOT EXISTS (SELECT *
FROM prows r2
WHERE r2.User_ID = r1.User_ID
AND r2.pnum = r1.pnum
AND r2.Event_Type = 'Purchase') THEN NULL
WHEN rn = 1 AND drn = 2 THEN 'Only'
WHEN rn = 1 THEN 'Introducer'
WHEN drn = 2 THEN 'Closer'
ELSE 'Influencer'
END AS Path
FROM prows r1
ORDER BY User_ID, Event_Time
输出:
Row User_ID Event_Time Event_Type Path
1 1 2020-01-01 View Introducer
2 1 2020-01-02 Click Closer
3 1 2020-01-03 Purchase (null)
4 2 2020-02-01 View Introducer
5 2 2020-02-02 Click Influencer
6 2 2020-02-03 View Closer
7 2 2020-02-04 Purchase (null)
8 2 2020-02-11 View Only
9 2 2020-02-12 Purchase (null)
10 2 2020-02-21 View Introducer
11 2 2020-02-22 Click Closer
12 2 2020-02-23 Purchase (null)
13 2 2020-02-27 View (null)
14 2 2020-02-28 Click (null)
15 3 2020-03-01 View Only
16 3 2020-03-02 Purchase (null)
SQL Server demo on SQLFiddle。同样的查询也将 运行 在 PostgreSQL 和 Oracle 上。
这与 Nick 的做法类似,但我认为逻辑更简单:
WITH e AS (
SELECT e.*,
SUM(CASE WHEN Event_Type = 'Purchase' THEN 1 ELSE 0 END) OVER
(PARTITION BY User_ID ORDER BY Event_Time DESC) AS grp
FROM events e
),
en as (
SELECT e.*,
COUNT(*) OVER (PARTITION BY user_id, grp) as cnt,
ROW_NUMBER() OVER (PARTITION BY user_id, grp ORDER BY Event_Time) as seqnum
FROM e
)
SELECT en.*,
(CASE WHEN grp = 0 -- no purchase event
THEN NULL
WHEN Event_Type = 'Purchase' -- the event itself
THEN NULL
WHEN seqnum = 1 AND cnt = 2 -- the special case of "ONLY"
THEN 'Only'
WHEN seqnum = 1 -- The first event
THEN 'Introducer'
WHEN seqnum = cnt - 1 -- The penultimate event
THEN 'Closer'
ELSE 'Influencer'
END) as Path
FROM en
ORDER BY User_ID, Event_Time;
特别是外查询中的子查询是不必要的。 grp = 0
找到最后一组可能没有购买的事件。我还认为根据事件总数和顺序计数器来编写逻辑更容易。
Here 是一个 db<>fiddle.