如何查询按匹配字符串列分组的行,但只计算一组特定关键字的最新行?

How to query rows grouped by matching string column but only count the most recent row for a specific set of keywords?

有一个 table 电子邮件事件,其中每一行都以特定的外发电子邮件记录 fk 和特定的收件人用户 fk 为键。在任何给定的时间,没有特定的顺序,甚至可以同时从不同的线程,我可以将新记录放入这个 table。以下是相关专栏...

id (pk), email_id (fk), user_id (fk), event (string/name), created_at

我正在计算给定电子邮件的总体事件计数,例如发送了多少封电子邮件、退回了多少封电子邮件等。但是我需要忽略特定用户的电子邮件事件的特定组合,因为它们在更新的事件进来了。例如,如果一行说电子邮件是针对特定用户的 'deferred',但后来插入了一个新的事件行,上面写着 'delivered' 或 'bounced' 那么我只想要最近添加的任何这些相关关键字的行被计为当前状态。

在阅读时执行此操作的好方法是什么?由于我需要进行多层分组并达到我的 SQL 排骨的极限,我遇到了麻烦,这是我正在尝试增强的查询,如下所述:

select `event`, COUNT(1) as count, COUNT(DISTINCT user_id) as unique_count
from `email_activity`
where `email_id` = 7518
group by `event`

对于大多数事件,我希望它们全部计算在内而无需任何替换,因此在这些情况下仅按事件分组就可以了,例如,如果某些事件是 'click' 或 'open' 事件,只需将它们加起来.

但是,如果同一 'deferred'、'bounced' 或 'delivered' 事件有任意数量 email_id/user_id,我只想计算其中最多的一个最近的 created_at 日期并忽略所有旧的。

示例行集(email_id、事件、user_id、created_at):

7518, "click", 25, 1-20-2021
7518, "click", 73, 1-5-2021
7518, "bounced", 45, 1-19-2021
7518, "deferred", 45, 1-17-2021
7518, "delivered", 19, 1-1-2021
7518, "delivered", 25, 1-1-2021
7518, "delivered", 73, 1-1-2021

所以电子邮件 7518 的查询计数为:

2 个“click”、3 个“delivered”和 1 个“bounced”,因为用户 45 将忽略“deferred”行,因为它较旧(仅“bounced”、“deferred”和“delivered”事件是这个“只计算最新的”规则的一部分,所有其他事件名称总是被计算在内)。

我在 postgres 数据库中做了这个,如下所示。第4步是可以直接使用的主要查询。我刚刚添加了最初的 3 个步骤,以便更好地理解每个子查询。

    create table email_event(id serial, email_id integer, user_id integer, event varchar(10), created_at date);
    
    insert into email_event(email_id, event, user_id, created_at) values(7518, 'click', 25, '1-20-2021');
    insert into email_event(email_id, event, user_id, created_at) values(7518, 'click', 73, '1-5-2021');
    insert into email_event(email_id, event, user_id, created_at)values(7518, 'bounced', 45, '1-19-2021');
    insert into email_event(email_id, event, user_id, created_at) values(7518, 'deferred', 45, '1-17-2021');
    insert into email_event(email_id, event, user_id, created_at)values(7518, 'delivered', 19, '1-1-2021');
    insert into email_event(email_id, event, user_id, created_at) values(7518, 'delivered', 25, '1-1-2021');
    insert into email_event(email_id, event, user_id, created_at) values(7518, 'delivered', 73, '1-1-2021');
  1. 首先我们将标记事件类别:

     select email_id, user_id,event, created_at,
     case when event in ('bounced', 'deferred', 'delivered') then 'Y' else 'N' end as pick_latest_flag
     from email_event;
    

  1. 然后我们将pick_latest_flag分组并根据user_id和flag进行排名。

     select a.*,
     row_number () over (partition by email_id, user_id, pick_latest_flag order by created_at desc) rn
     from (
     select email_id, user_id,event, created_at,
     case when event in ('bounced', 'deferred', 'delivered') then 'Y' else 'N' end as pick_latest_flag 
     from email_event 
     ) A;
    

  1. 然后我们将根据行号过滤掉pick_latest_flag条记录。

     select * from (
     select a.*,
     row_number () over (partition by user_id, pick_latest_flag order by created_at desc) rn
     from (
     select email_id, user_id,event, created_at,
     case when event in ('bounced', 'deferred', 'delivered') then 'Y' else 'N' end as pick_latest_flag 
     from email_event 
     ) A 
     ) b where pick_latest_flag = 'N' or (pick_latest_flag = 'Y' and rn = 1);
    

  1. 在最后一步,将它们分组到 email_id 和事件:

     select email_id, event, count(*) from 
     (
     select * from (
     select a.*,
     row_number () over (partition by user_id, pick_latest_flag order by created_at desc) rn
     from (
     select email_id, user_id,event, created_at,
     case when event in ('bounced', 'deferred', 'delivered') then 'Y' else 'N' end as pick_latest_flag 
     from email_event 
     ) A 
     ) b where pick_latest_flag = 'N' or (pick_latest_flag = 'Y' and rn = 1)
     ) c group by email_id,event order by event;
    

考虑到您将 email_id 作为参数传递,遵循 query 应该能够为您提供预期的结果-

select sum(case when f.user_id is not null then 1 else 0 end) sum_event, f.event
from (select * from email_event e1
       where exists (select id from email_event e2 where e2.email_id = e1.email_id and e2.event = 'deferred')
        and e1.event <> 'deferred') f
 where f.email_id = 7518 group by f.event;

输出 测试 email_id = 7518:

+-----------+-----------+
| sum_event |   event   |
+-----------+-----------+
|     2     | click     |
+-----------+-----------+
|     1     | bounced   |
+-----------+-----------+
|     3     | delivered |
+-----------+-----------+

解决方案 1:

如果WITH子句和window函数可用,我更喜欢将table分成两部分,给每一行适当的优先级,通过使用将它们组合起来UNION ALL,最后只聚合优先级最高的行。

WITH specific_email_activity AS (
    SELECT * FROM email_activity WHERE email_id = 7518
),
specific_email_activity_with_priority AS (
    SELECT
        *,
        1 AS rank_priority
    FROM
        specific_email_activity
    WHERE
        event NOT IN ('deferred', 'bounced', 'delivered')
    UNION ALL
    SELECT
        *,
        ROW_NUMBER () over (PARTITION BY email_id, user_id ORDER BY created_at DESC) AS rank_priority
    FROM
        specific_email_activity
    WHERE
        event IN ('deferred', 'bounced', 'delivered')
)
SELECT
    email_id,
    event,
    COUNT(*) AS count_event,
    COUNT(DISTINCT user_id) AS unique_count_event
FROM
    specific_email_activity_with_priority
WHERE
    rank_priority = 1
GROUP BY
    email_id,
    event
ORDER BY
    email_id,
    event;

解决方案 2:

如果您不能使用 WITH 子句和 window 函数,请尝试以下代码:

SELECT
    email_id,
    event,
    COUNT(*) AS count_event,
    COUNT(DISTINCT user_id) AS unique_count_event
FROM
    (
        SELECT
            *
        FROM
            email_activity
        WHERE
            email_id = 7518
            AND event NOT IN ('deferred', 'bounced', 'delivered')
        UNION ALL
        SELECT
            *
        FROM
            email_activity
        WHERE
            email_id = 7518
            AND event IN ('deferred', 'bounced', 'delivered')
            AND email_activity.created_at = (
                SELECT
                    MAX(created_at)
                FROM
                    email_activity AS ea
                WHERE
                    email_id = 7518
                    AND event IN ('deferred', 'bounced', 'delivered')
                    AND email_id = email_activity.email_id
                    AND user_id = email_activity.user_id
            )
    ) AS t
GROUP BY
    email_id,
    event
ORDER BY
    email_id,
    event;

解决方案 1 和 2 的输出:

email_id event count_event unique_count_event
7518 bounced 1 1
7518 click 2 2
7518 delivered 3 3

注:

  • 如果有不止一行相同的user_id和相同的created_at,其中它们的event是由'deferred'、'bounced',或者'delivered',你会得到意想不到的结果。在这种情况下,必须明确其中哪一个应该优先计算。然后,代码必须按照那个规则修改。
  • 如果 event 可以为 null,则必须阐明 NULL 在聚合中的处理方式。然后,代码必须按照那个规则修改。

示例 table 创建:

示例 table 可以通过以下 sql 创建:

CREATE TABLE IF NOT EXISTS email_activity(id SERIAL PRIMARY KEY, email_id INT, user_id INT, event VARCHAR(16), created_at DATE);
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(7518, 'click', 25, '2021-1-20');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(7518, 'click', 73, '2021-1-5');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(7518, 'bounced', 45, '2021-1-19');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(7518, 'deferred', 45, '2021-1-17');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(7518, 'delivered', 19, '2021-1-1');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(7518, 'delivered', 25, '2021-1-1');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(7518, 'delivered', 73,'2021-1-1');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(9999, 'click', 25, '2021-1-20');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(9999, 'click', 73, '2021-1-5');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(9999, 'bounced', 45, '2021-1-19');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(9999, 'deferred', 45, '2021-1-17');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(9999, 'delivered', 19, '2021-1-1');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(9999, 'delivered', 25, '2021-1-1');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(9999, 'delivered', 73,'2021-1-1');

SQL

with cte AS
  (select *
   from `email_activity` ea1
   where `email_id` = 7518
     and `event` not in ('bounced', 'deferred', 'delivered')
      or not exists (select * from `email_activity` ea2
                     where ea2.`user_id` = ea1.`user_id`
                     and ea2.`event` IN ('bounced', 'deferred', 'delivered')
                     and ea2.`created_at` > ea1.`created_at`))
select `event`, COUNT(1) as count, COUNT(DISTINCT user_id) as unique_count
from cte
group by `event`;

演示

https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=725c38e7068fdd68cfb7315a798bdd7e

说明

通用 Table 表达式 (CTE) 包括事件类型不是“退回”、“延迟”或“已交付”(即它是“已点击”,除非有任何其他可能性)的所有行我不知道)。它还包括事件类型在该列表中但在该列表中没有事件类型的更新记录的行。