SQL:如何 select 每个月每个时间戳的唯一用户

SQL: how to select unique users from every timestamp over every month

我在 SQL 中有一个 table 有 1000 个用户,前 10 个条目:

table1:

+-------------+---------------+-------------------------+
| User_id     | Mail_id       | Reg_date(y-m-d)         |
+-------------+---------------+-------------------------+
| user1       | email1        | 2019-11-09 12:23:53.253 |
| user1       | email1        | 2019-11-09 12:24:53.253 |
| user1       | email1        | 2019-11-09 13:20:53.253 |
| user1       | email1        | 2019-08-09 11:23:53.253 |
| user2       | email2        | 2019-09-08 10:29:53.253 |
| user3       | email3        | 2019-09-08 14:23:53.253 |
| user1       | email1        | 2019-12-09 13:20:53.253 |
| user1       | email1        | 2019-10-10 11:23:53.253 |
| user1       | email1        | 2019-10-13 10:29:53.253 |
| user2       | email5        | 2019-11-14 10:29:53.253 |
+-------------+---------------+-------------------------+

table2:

+-------------+---------------+-------------------------+
| User_id     | Session_id    | Activity_date(y-m-d)    |
+-------------+---------------+-------------------------+
| user1       | s1            | 2019-11-09 12:23:53.253 |
| user1       | s2            | 2019-12-09 12:24:53.253 |
| user1       | s3            | 2019-12-09 13:20:53.253 |
| user1       | s4            | 2020-01-09 11:23:53.253 |
| user2       | s5            | 2019-12-08 10:29:53.253 |
| user3       | s6            | 2020-02-08 14:23:53.253 |
| user1       | s7            | 2019-12-09 13:20:53.253 |
| user1       | s8            | 2020-03-10 11:23:53.253 |
| user1       | s9            | 2020-02-13 10:29:53.253 |
| user2       | s10           | 2020-03-14 10:29:53.253 |
+-------------+---------------+-------------------------+

我只想 select DISTINCT 用户,前提是 Activity_date(y-m-d)2019-11-012019-12-15 之间的日期开始并且出现在日期 [=19] 之间的每个月=]到2020-03-30(跟踪用户activity连续3个月)。

output:

(这里 User1 是唯一一个 Activity_date(y-m-d) 介于 2019-11-012019-12-15 之间并且在之后的每个月都出现在日期 [=19] 之间的人=] 到 2020-03-30.

User2 的起始 Activity_date(y-m-d) 介于 2019-11-012019-12-15 之间,但是 Activity_date(y-m-d) 并非每个月(即一月和二月)都存在,所以这输出中未考虑用户。

+-------------+---------------+-------------------------+
| User_id     | Mail_id       | Activity_date(y-m-d)    |
+-------------+---------------+-------------------------+
| user1       | email1        | 2019-11-09 12:23:53.253 |
| user1       | email1        | 2019-12-09 12:24:53.253 |
| user1       | email1        | 2019-12-09 13:20:53.253 |
| user1       | email1        | 2020-01-09 11:23:53.253 |
| user1       | email1        | 2019-12-09 14:20:53.253 |
| user1       | email1        | 2020-02-13 10:29:53.253 |
| user1       | email1        | 2020-03-10 11:23:53.253 |
+-------------+---------------+-------------------------+

如何在 SQL (Redshift) 中实现这一点?

这里 fiddle 使用您的示例数据重新创建示例结果。用于测试的数据库 fiddle 使用的是 postgres,但这也适用于 redshift。让我知道这是否有效。

该方法首先使用递归 cte month_periodsmonths 生成所有连续月份,然后检查用户是否在 users_active_in_months 中生成的每个连续年-月中处于活动状态。最终投影选择目标数据集中共享的 User_idMail_idActivity_date,其中 Activity_date2019-11-012019-12-152019-11-012020-03-30 或简单地从 2019-11-012020-03-30 因为这是完全包含的。

CREATE TABLE table1 (
  User_id VARCHAR(5),
  Mail_id VARCHAR(6),
  Reg_date TIMESTAMP
);

INSERT INTO table1
  (User_id, Mail_id, Reg_date)
VALUES
  ('user1', 'email1', '2019-11-09 12:23:53.253'),
  ('user1', 'email1', '2019-11-09 12:24:53.253'),
  ('user1', 'email1', '2019-11-09 13:20:53.253'),
  ('user1', 'email1', '2019-08-09 11:23:53.253'),
  ('user2', 'email2', '2019-09-08 10:29:53.253'),
  ('user3', 'email3', '2019-09-08 14:23:53.253'),
  ('user1', 'email1', '2019-12-09 13:20:53.253'),
  ('user1', 'email1', '2019-10-10 11:23:53.253'),
  ('user1', 'email1', '2019-10-13 10:29:53.253'),
  ('user2', 'email5', '2019-11-14 10:29:53.253');

CREATE TABLE table2 (
  User_id VARCHAR(5),
  Session_id VARCHAR(3),
  Activity_date TIMESTAMP
);

INSERT INTO table2
  (User_id, Session_id, Activity_date)
VALUES
  ('user1', 's1', '2019-11-09 12:23:53.253'),
  ('user1', 's2', '2019-12-09 12:24:53.253'),
  ('user1', 's3', '2019-12-09 13:20:53.253'),
  ('user1', 's4', '2020-01-09 11:23:53.253'),
  ('user2', 's5', '2019-12-08 10:29:53.253'),
  ('user3', 's6', '2020-02-08 14:23:53.253'),
  ('user1', 's7', '2019-12-09 13:20:53.253'),
  ('user1', 's8', '2020-03-10 11:23:53.253'),
  ('user1', 's9', '2020-02-13 10:29:53.253'),
  ('user2', 's10', '2020-03-14 10:29:53.253');
  
  

查询#1

WITH recursive month_periods AS (
    SELECT '2019-11-01'::timestamp as dt UNION ALL
    SELECT (dt + interval '1 month')::timestamp as dt 
    FROM month_periods 
    WHERE dt <= '2020-03-30'
), 
months AS (
    SELECT EXTRACT(YEAR FROM dt)*100+EXTRACT(MONTH from dt) as ym from month_periods
),
users_active_in_months AS (
    SELECT 
        User_id
    FROM (
    SELECT DISTINCT 
         m.ym,
         t2.User_id
         
    FROM 
         months m
    LEFT JOIN
         table2 t2 ON
    ( EXTRACT(YEAR FROM t2.Activity_date)*100+EXTRACT(MONTH FROM t2.Activity_date))=m.ym
    WHERE t2.User_id is NOT NULL
    ) t
    GROUP BY User_id
 HAVING COUNT(User_id) = (SELECT COUNT(1) FROM months) - 1
)
SELECT DISTINCT
    t2.User_Id,
    t1.Mail_id,
    t2.Activity_date 
FROM
    table2 t2
INNER JOIN
    table1 t1 ON t2.User_id = t1.User_id
INNER JOIN
    users_active_in_months um ON um.User_id = t1.User_id
WHERE
    t2.Activity_date BETWEEN '2019-11-01' and '2020-03-30';
user_id mail_id activity_date
user1 email1 2019-11-09T12:23:53.253Z
user1 email1 2019-12-09T12:24:53.253Z
user1 email1 2019-12-09T13:20:53.253Z
user1 email1 2020-01-09T11:23:53.253Z
user1 email1 2020-02-13T10:29:53.253Z
user1 email1 2020-03-10T11:23:53.253Z

View on DB Fiddle

您可以使用聚合。根据您的描述,table1 似乎不需要。你可以获得 user_id:

select t2.user_id
from table2 t2
group by user_id
having min(activity_date) >= '2019-11-01' and
       min(activity_date) <= '2019-12-15' and
       count(distinct date_trunc('month', activity_date)) = 5;

然后您可以加入您需要的任何其他信息。

注意:以上回答了您提出的具体问题。但是,因为您希望每个月都有 activity,所以这实际上要求第一个日期是在 11 月,而不是 12 月。您可以调整逻辑来处理这个问题。