仅计算分组查询中的第一个唯一组合

Count only first unique combination in grouped query

我有一个 table 看起来像这样

| date       | user_id | event_id | message_id |
|------------|---------|----------|------------|
| 2021-08-04 | 1       | 1        | 1          |
| 2021-08-04 | 1       | 1        | 2          |
| 2021-08-04 | 1       | 2        | 3          |
| 2021-08-04 | 2       | 1        | 4          |
| 2021-08-05 | 1       | 1        | 1          |
| 2021-08-05 | 2       | 2        | 5          |

我想按 user_id、日期和事件对所有内容进行分组。但问题是:我想计算 (date-user-event-message) 的唯一组合,并且只将它添加到日期行,它首先出现的地方。换句话说,如果我有相同的 message_id、相同的 user_id 和相同的 event_id 但不同的日期,我只想计算一次并添加到 date-user-event 行此消息最先出现。所以这就是我想要得到的:

| date       | user_id | event_id | count | count_unique |
|------------|---------|----------|-------|--------------|
| 2021-08-04 | 1       | 1        | 2     | 2            | <--- Unique count is 2 because this is the first date when two unique combinations of user+event+message found
| 2021-08-04 | 1       | 2        | 1     | 1            |
| 2021-08-04 | 2       | 1        | 1     | 1            |
| 2021-08-05 | 1       | 1        | 1     | 0            | <--- Unique count is 0, because this message_id for the same user and event already exists for previous date
| 2021-08-05 | 2       | 2        | 1     | 1            |

这有点棘手,我很自信这是不可能的,但我仍然需要确定。

我想到了这个查询:

SELECT
    date,
    user_id,
    event_id,
    COUNT(*) as count,
    COUNT(DISTINCT message_id) as count_unique
FROM events
GROUP BY user_id, event_id, date

但是我得到的结果显然不是我想要的:

| date       | user_id | event_id | count | count_unique |
|------------|---------|----------|-------|--------------|
| 2021-08-04 | 1       | 1        | 2     | 2            |
| 2021-08-04 | 1       | 2        | 1     | 1            |
| 2021-08-04 | 2       | 1        | 1     | 1            |
| 2021-08-05 | 1       | 1        | 1     | 1            | <--- Unique count is 1, because it counts distinct message_ids within the group (row).
| 2021-08-05 | 2       | 2        | 1     | 1            |

所以基本上我需要以某种方式忽略不同计数的日期(例如,在组外计数),并且仅对行(组)的计数值求和,其中日期是首先找到该组合的日期。

此查询将过滤那些 user_id/event_id/message_id 组合出现的第一个日期(使用 row_number window 函数)- 然后在过滤集上聚合:

select 
   date
  , user_id
  , event_id
  , count(distinct message_id) as count_messages
from
(
select distinct date
  , user_id
  , event_id
  , message_id
  , row_number() over 
    (
      partition by user_id,event_id,message_id 
      order by date asc
    ) as rank_date
from events
) as DT
where rank_date = 1

换句话说 - 这应该只计算 user_id/event_id/message_id 组合出现的第一个日期。

要计算 count_unique 您只想保留用户为某个事件发送的消息的第一次时间。

要获得此数据集,您必须执行此查询。

select min(a_date) as date ,userid,event_id,message_id 
    from events 
    group by userid , event_id , message_id

所以这之后很容易计算出值count_unique

select count(*) as count_unique , date , userid , event_id 
    from ( 
       select min(date) as date ,userid,event_id,message_id 
       from events
          group by userid , event_id , message_id ) e 
 group by date , userid , event_id ;

现在您可以左连接查询,该查询按用户 ID、事件 ID 和日期对消息进行计数

select a.*,coalesce(b.count_unique,0) as count_unique 
   from ( 
     select date , userid , event_id , count(*) as cnt  from events 
     group by date , userid , event_id 
  ) a left join (
    select count(*) as count_unique , date , userid , event_id 
       from ( 
          select min(date) as date ,userid,event_id,message_id 
          from events
             group by userid , event_id , message_id ) e 
       group by date , userid , event_id 
  ) b on a.date=b.date and
        a.userid=b.userid and
        a.event_id = b.event_id;