SQL:如何 select 每个月每个时间戳的唯一用户
SQL: how to select unique users from every timestamp over every month
我在 SQL
中有一个 table 有 1000 个用户,前 10 个条目:
table1
:
+-------------+---------------+-------------------------+
| User_id | Mail_id | Reg_date(y-m-d) |
+-------------+---------------+-------------------------+
| user1 | email1 | 2019-11-09 12:23:53.253 |
| user1 | email1 | 2019-11-09 12:24:53.253 |
| user1 | email1 | 2019-11-09 13:20:53.253 |
| user1 | email1 | 2019-08-09 11:23:53.253 |
| user2 | email2 | 2019-09-08 10:29:53.253 |
| user3 | email3 | 2019-09-08 14:23:53.253 |
| user1 | email1 | 2019-12-09 13:20:53.253 |
| user1 | email1 | 2019-10-10 11:23:53.253 |
| user1 | email1 | 2019-10-13 10:29:53.253 |
| user2 | email5 | 2019-11-14 10:29:53.253 |
+-------------+---------------+-------------------------+
table2
:
+-------------+---------------+-------------------------+
| User_id | Session_id | Activity_date(y-m-d) |
+-------------+---------------+-------------------------+
| user1 | s1 | 2019-11-09 12:23:53.253 |
| user1 | s2 | 2019-12-09 12:24:53.253 |
| user1 | s3 | 2019-12-09 13:20:53.253 |
| user1 | s4 | 2020-01-09 11:23:53.253 |
| user2 | s5 | 2019-12-08 10:29:53.253 |
| user3 | s6 | 2020-02-08 14:23:53.253 |
| user1 | s7 | 2019-12-09 13:20:53.253 |
| user1 | s8 | 2020-03-10 11:23:53.253 |
| user1 | s9 | 2020-02-13 10:29:53.253 |
| user2 | s10 | 2020-03-14 10:29:53.253 |
+-------------+---------------+-------------------------+
我只想 select DISTINCT 用户,前提是 Activity_date(y-m-d)
在 2019-11-01
到 2019-12-15
之间的日期开始并且出现在日期 [=19] 之间的每个月=]到2020-03-30
(跟踪用户activity连续3个月)。
output
:
(这里 User1
是唯一一个 Activity_date(y-m-d)
介于 2019-11-01
和 2019-12-15
之间并且在之后的每个月都出现在日期 [=19] 之间的人=] 到 2020-03-30
.
User2 的起始 Activity_date(y-m-d)
介于 2019-11-01
到 2019-12-15
之间,但是 Activity_date(y-m-d)
并非每个月(即一月和二月)都存在,所以这输出中未考虑用户。
+-------------+---------------+-------------------------+
| User_id | Mail_id | Activity_date(y-m-d) |
+-------------+---------------+-------------------------+
| user1 | email1 | 2019-11-09 12:23:53.253 |
| user1 | email1 | 2019-12-09 12:24:53.253 |
| user1 | email1 | 2019-12-09 13:20:53.253 |
| user1 | email1 | 2020-01-09 11:23:53.253 |
| user1 | email1 | 2019-12-09 14:20:53.253 |
| user1 | email1 | 2020-02-13 10:29:53.253 |
| user1 | email1 | 2020-03-10 11:23:53.253 |
+-------------+---------------+-------------------------+
如何在 SQL (Redshift) 中实现这一点?
这里 fiddle 使用您的示例数据重新创建示例结果。用于测试的数据库 fiddle 使用的是 postgres,但这也适用于 redshift。让我知道这是否有效。
该方法首先使用递归 cte month_periods
和 months
生成所有连续月份,然后检查用户是否在 users_active_in_months
中生成的每个连续年-月中处于活动状态。最终投影选择目标数据集中共享的 User_id
、Mail_id
和 Activity_date
,其中 Activity_date
从 2019-11-01
到 2019-12-15
和 2019-11-01
到 2020-03-30
或简单地从 2019-11-01
到 2020-03-30
因为这是完全包含的。
CREATE TABLE table1 (
User_id VARCHAR(5),
Mail_id VARCHAR(6),
Reg_date TIMESTAMP
);
INSERT INTO table1
(User_id, Mail_id, Reg_date)
VALUES
('user1', 'email1', '2019-11-09 12:23:53.253'),
('user1', 'email1', '2019-11-09 12:24:53.253'),
('user1', 'email1', '2019-11-09 13:20:53.253'),
('user1', 'email1', '2019-08-09 11:23:53.253'),
('user2', 'email2', '2019-09-08 10:29:53.253'),
('user3', 'email3', '2019-09-08 14:23:53.253'),
('user1', 'email1', '2019-12-09 13:20:53.253'),
('user1', 'email1', '2019-10-10 11:23:53.253'),
('user1', 'email1', '2019-10-13 10:29:53.253'),
('user2', 'email5', '2019-11-14 10:29:53.253');
CREATE TABLE table2 (
User_id VARCHAR(5),
Session_id VARCHAR(3),
Activity_date TIMESTAMP
);
INSERT INTO table2
(User_id, Session_id, Activity_date)
VALUES
('user1', 's1', '2019-11-09 12:23:53.253'),
('user1', 's2', '2019-12-09 12:24:53.253'),
('user1', 's3', '2019-12-09 13:20:53.253'),
('user1', 's4', '2020-01-09 11:23:53.253'),
('user2', 's5', '2019-12-08 10:29:53.253'),
('user3', 's6', '2020-02-08 14:23:53.253'),
('user1', 's7', '2019-12-09 13:20:53.253'),
('user1', 's8', '2020-03-10 11:23:53.253'),
('user1', 's9', '2020-02-13 10:29:53.253'),
('user2', 's10', '2020-03-14 10:29:53.253');
查询#1
WITH recursive month_periods AS (
SELECT '2019-11-01'::timestamp as dt UNION ALL
SELECT (dt + interval '1 month')::timestamp as dt
FROM month_periods
WHERE dt <= '2020-03-30'
),
months AS (
SELECT EXTRACT(YEAR FROM dt)*100+EXTRACT(MONTH from dt) as ym from month_periods
),
users_active_in_months AS (
SELECT
User_id
FROM (
SELECT DISTINCT
m.ym,
t2.User_id
FROM
months m
LEFT JOIN
table2 t2 ON
( EXTRACT(YEAR FROM t2.Activity_date)*100+EXTRACT(MONTH FROM t2.Activity_date))=m.ym
WHERE t2.User_id is NOT NULL
) t
GROUP BY User_id
HAVING COUNT(User_id) = (SELECT COUNT(1) FROM months) - 1
)
SELECT DISTINCT
t2.User_Id,
t1.Mail_id,
t2.Activity_date
FROM
table2 t2
INNER JOIN
table1 t1 ON t2.User_id = t1.User_id
INNER JOIN
users_active_in_months um ON um.User_id = t1.User_id
WHERE
t2.Activity_date BETWEEN '2019-11-01' and '2020-03-30';
user_id
mail_id
activity_date
user1
email1
2019-11-09T12:23:53.253Z
user1
email1
2019-12-09T12:24:53.253Z
user1
email1
2019-12-09T13:20:53.253Z
user1
email1
2020-01-09T11:23:53.253Z
user1
email1
2020-02-13T10:29:53.253Z
user1
email1
2020-03-10T11:23:53.253Z
您可以使用聚合。根据您的描述,table1
似乎不需要。你可以获得 user_id
:
select t2.user_id
from table2 t2
group by user_id
having min(activity_date) >= '2019-11-01' and
min(activity_date) <= '2019-12-15' and
count(distinct date_trunc('month', activity_date)) = 5;
然后您可以加入您需要的任何其他信息。
注意:以上回答了您提出的具体问题。但是,因为您希望每个月都有 activity,所以这实际上要求第一个日期是在 11 月,而不是 12 月。您可以调整逻辑来处理这个问题。
我在 SQL
中有一个 table 有 1000 个用户,前 10 个条目:
table1
:
+-------------+---------------+-------------------------+
| User_id | Mail_id | Reg_date(y-m-d) |
+-------------+---------------+-------------------------+
| user1 | email1 | 2019-11-09 12:23:53.253 |
| user1 | email1 | 2019-11-09 12:24:53.253 |
| user1 | email1 | 2019-11-09 13:20:53.253 |
| user1 | email1 | 2019-08-09 11:23:53.253 |
| user2 | email2 | 2019-09-08 10:29:53.253 |
| user3 | email3 | 2019-09-08 14:23:53.253 |
| user1 | email1 | 2019-12-09 13:20:53.253 |
| user1 | email1 | 2019-10-10 11:23:53.253 |
| user1 | email1 | 2019-10-13 10:29:53.253 |
| user2 | email5 | 2019-11-14 10:29:53.253 |
+-------------+---------------+-------------------------+
table2
:
+-------------+---------------+-------------------------+
| User_id | Session_id | Activity_date(y-m-d) |
+-------------+---------------+-------------------------+
| user1 | s1 | 2019-11-09 12:23:53.253 |
| user1 | s2 | 2019-12-09 12:24:53.253 |
| user1 | s3 | 2019-12-09 13:20:53.253 |
| user1 | s4 | 2020-01-09 11:23:53.253 |
| user2 | s5 | 2019-12-08 10:29:53.253 |
| user3 | s6 | 2020-02-08 14:23:53.253 |
| user1 | s7 | 2019-12-09 13:20:53.253 |
| user1 | s8 | 2020-03-10 11:23:53.253 |
| user1 | s9 | 2020-02-13 10:29:53.253 |
| user2 | s10 | 2020-03-14 10:29:53.253 |
+-------------+---------------+-------------------------+
我只想 select DISTINCT 用户,前提是 Activity_date(y-m-d)
在 2019-11-01
到 2019-12-15
之间的日期开始并且出现在日期 [=19] 之间的每个月=]到2020-03-30
(跟踪用户activity连续3个月)。
output
:
(这里 User1
是唯一一个 Activity_date(y-m-d)
介于 2019-11-01
和 2019-12-15
之间并且在之后的每个月都出现在日期 [=19] 之间的人=] 到 2020-03-30
.
User2 的起始 Activity_date(y-m-d)
介于 2019-11-01
到 2019-12-15
之间,但是 Activity_date(y-m-d)
并非每个月(即一月和二月)都存在,所以这输出中未考虑用户。
+-------------+---------------+-------------------------+
| User_id | Mail_id | Activity_date(y-m-d) |
+-------------+---------------+-------------------------+
| user1 | email1 | 2019-11-09 12:23:53.253 |
| user1 | email1 | 2019-12-09 12:24:53.253 |
| user1 | email1 | 2019-12-09 13:20:53.253 |
| user1 | email1 | 2020-01-09 11:23:53.253 |
| user1 | email1 | 2019-12-09 14:20:53.253 |
| user1 | email1 | 2020-02-13 10:29:53.253 |
| user1 | email1 | 2020-03-10 11:23:53.253 |
+-------------+---------------+-------------------------+
如何在 SQL (Redshift) 中实现这一点?
这里 fiddle 使用您的示例数据重新创建示例结果。用于测试的数据库 fiddle 使用的是 postgres,但这也适用于 redshift。让我知道这是否有效。
该方法首先使用递归 cte month_periods
和 months
生成所有连续月份,然后检查用户是否在 users_active_in_months
中生成的每个连续年-月中处于活动状态。最终投影选择目标数据集中共享的 User_id
、Mail_id
和 Activity_date
,其中 Activity_date
从 2019-11-01
到 2019-12-15
和 2019-11-01
到 2020-03-30
或简单地从 2019-11-01
到 2020-03-30
因为这是完全包含的。
CREATE TABLE table1 (
User_id VARCHAR(5),
Mail_id VARCHAR(6),
Reg_date TIMESTAMP
);
INSERT INTO table1
(User_id, Mail_id, Reg_date)
VALUES
('user1', 'email1', '2019-11-09 12:23:53.253'),
('user1', 'email1', '2019-11-09 12:24:53.253'),
('user1', 'email1', '2019-11-09 13:20:53.253'),
('user1', 'email1', '2019-08-09 11:23:53.253'),
('user2', 'email2', '2019-09-08 10:29:53.253'),
('user3', 'email3', '2019-09-08 14:23:53.253'),
('user1', 'email1', '2019-12-09 13:20:53.253'),
('user1', 'email1', '2019-10-10 11:23:53.253'),
('user1', 'email1', '2019-10-13 10:29:53.253'),
('user2', 'email5', '2019-11-14 10:29:53.253');
CREATE TABLE table2 (
User_id VARCHAR(5),
Session_id VARCHAR(3),
Activity_date TIMESTAMP
);
INSERT INTO table2
(User_id, Session_id, Activity_date)
VALUES
('user1', 's1', '2019-11-09 12:23:53.253'),
('user1', 's2', '2019-12-09 12:24:53.253'),
('user1', 's3', '2019-12-09 13:20:53.253'),
('user1', 's4', '2020-01-09 11:23:53.253'),
('user2', 's5', '2019-12-08 10:29:53.253'),
('user3', 's6', '2020-02-08 14:23:53.253'),
('user1', 's7', '2019-12-09 13:20:53.253'),
('user1', 's8', '2020-03-10 11:23:53.253'),
('user1', 's9', '2020-02-13 10:29:53.253'),
('user2', 's10', '2020-03-14 10:29:53.253');
查询#1
WITH recursive month_periods AS (
SELECT '2019-11-01'::timestamp as dt UNION ALL
SELECT (dt + interval '1 month')::timestamp as dt
FROM month_periods
WHERE dt <= '2020-03-30'
),
months AS (
SELECT EXTRACT(YEAR FROM dt)*100+EXTRACT(MONTH from dt) as ym from month_periods
),
users_active_in_months AS (
SELECT
User_id
FROM (
SELECT DISTINCT
m.ym,
t2.User_id
FROM
months m
LEFT JOIN
table2 t2 ON
( EXTRACT(YEAR FROM t2.Activity_date)*100+EXTRACT(MONTH FROM t2.Activity_date))=m.ym
WHERE t2.User_id is NOT NULL
) t
GROUP BY User_id
HAVING COUNT(User_id) = (SELECT COUNT(1) FROM months) - 1
)
SELECT DISTINCT
t2.User_Id,
t1.Mail_id,
t2.Activity_date
FROM
table2 t2
INNER JOIN
table1 t1 ON t2.User_id = t1.User_id
INNER JOIN
users_active_in_months um ON um.User_id = t1.User_id
WHERE
t2.Activity_date BETWEEN '2019-11-01' and '2020-03-30';
user_id | mail_id | activity_date |
---|---|---|
user1 | email1 | 2019-11-09T12:23:53.253Z |
user1 | email1 | 2019-12-09T12:24:53.253Z |
user1 | email1 | 2019-12-09T13:20:53.253Z |
user1 | email1 | 2020-01-09T11:23:53.253Z |
user1 | email1 | 2020-02-13T10:29:53.253Z |
user1 | email1 | 2020-03-10T11:23:53.253Z |
您可以使用聚合。根据您的描述,table1
似乎不需要。你可以获得 user_id
:
select t2.user_id
from table2 t2
group by user_id
having min(activity_date) >= '2019-11-01' and
min(activity_date) <= '2019-12-15' and
count(distinct date_trunc('month', activity_date)) = 5;
然后您可以加入您需要的任何其他信息。
注意:以上回答了您提出的具体问题。但是,因为您希望每个月都有 activity,所以这实际上要求第一个日期是在 11 月,而不是 12 月。您可以调整逻辑来处理这个问题。