您如何查找和计算满足序列的出现次数?
How do you find and count occurrences of sequences being met?
我有一个从 1/2016 到 2/2017 的数据集,包含用户、月份、家庭位置、访问位置和访问次数。用户每月可以访问多个位置,或 none。该数据集每月包含多行,因为用户可以访问不同的位置。
这是数据,然后我将引导您完成下面的 Rextester link。
| user_id | year_month | homelocid | visitlocid | fact_count |
|---------|------------|-----------|------------|------------|
| 1234 | 201601 | L9900 | L9987 | 8 |
| 1234 | 201601 | L9900 | L4456 | 1 |
| 1234 | 201602 | L9900 | L9987 | 5 |
| 1234 | 201603 | L9900 | L9987 | 1 |
| 1234 | 201604 | L9900 | L9987 | 4 |
| 1234 | 201605 | L9900 | L9987 | 3 |
| 1234 | 201605 | L9900 | L4456 | 3 |
| 1234 | 201605 | L9900 | L5567 | 26 |
| 1234 | 201606 | L9900 | L4456 | 1 |
| 1234 | 201606 | L9900 | L9987 | 21 |
| 1234 | 201606 | L9900 | L6678 | 2 |
| 1234 | 201606 | L9900 | L5567 | 1 |
| 1234 | 201607 | L9900 | L9987 | 6 |
| 1234 | 201607 | L9900 | L5567 | 22 |
| 1234 | 201607 | L9900 | L6678 | 25 |
| 1234 | 201608 | NULL | NULL | NULL |
| 1234 | 201609 | L9900 | L9987 | 1 |
| 1234 | 201609 | L9900 | L4456 | 3 |
| 1234 | 201610 | L9900 | L9987 | 1 |
| 1234 | 201611 | NULL | NULL | NULL |
| 1234 | 201612 | NULL | NULL | NULL |
| 1234 | 201701 | NULL | NULL | NULL |
| 1234 | 201702 | L9900 | L9987 | 6 |
| 1234 | 201702 | L9900 | L2334 | 1 |
| 5556 | 201601 | L9900 | L3445 | 1 |
| 5556 | 201602 | L9900 | L3445 | 3 |
| 5556 | 201603 | L9900 | L3445 | 1 |
| 5556 | 201604 | L9900 | L3445 | 21 |
| 5556 | 201605 | L9900 | L3445 | 1 |
| 5556 | 201606 | L9900 | L3445 | 2 |
| 5556 | 201607 | NULL | NULL | NULL |
| 5556 | 201608 | NULL | NULL | NULL |
| 5556 | 201609 | L9900 | L3445 | 1 |
| 5556 | 201610 | NULL | NULL | NULL |
| 5556 | 201611 | NULL | NULL | NULL |
| 5556 | 201612 | L9900 | L3445 | 1 |
| 5556 | 201701 | NULL | NULL | NULL |
| 5556 | 201702 | L9900 | L3445 | 1 |
| 9988 | 201601 | L9900 | L2334 | 27 |
| 9988 | 201602 | L9900 | L3445 | 22 |
| 9988 | 201602 | L9900 | L9987 | 1 |
| 9988 | 201602 | L9900 | L2334 | 2 |
| 9988 | 201603 | L9900 | L4556 | 1 |
| 9988 | 201603 | L9900 | L2334 | 2 |
| 9988 | 201603 | L9900 | L3445 | 21 |
| 9988 | 201603 | L9900 | L9987 | 1 |
| 9988 | 201604 | L9900 | L3445 | 25 |
| 9988 | 201604 | L9900 | L2334 | 1 |
| 9988 | 201604 | L9900 | L4556 | 1 |
| 9988 | 201605 | L9900 | L3445 | 26 |
| 9988 | 201605 | L9900 | L2334 | 1 |
| 9988 | 201606 | L9900 | L3445 | 22 |
| 9988 | 201607 | L9900 | L3445 | 5 |
| 9988 | 201608 | L9900 | L3445 | 6 |
| 9988 | 201609 | L9900 | L3445 | 11 |
| 9988 | 201610 | L9900 | L3445 | 17 |
| 9988 | 201611 | L9900 | L3445 | 9 |
| 9988 | 201612 | L9900 | L2334 | 1 |
| 9988 | 201612 | L9900 | L3445 | 14 |
| 9988 | 201701 | L9900 | L3445 | 14 |
| 9988 | 201701 | L9900 | L2334 | 2 |
| 9988 | 201702 | L9900 | L4556 | 1 |
| 9988 | 201702 | L9900 | L3445 | 9 |
我正在尝试计算用户拥有的次数:
- 1 个月内超过 20 fact_count
- 连续2个月fact_count大于20
- 超过 20 fact_count 连续 3 个月
我还需要从两个层面总结数据:
- 在 user_id 水平上(分母 = # 个用户,分子 = # 个连续超过 20 fact_count 1、2 和 3 个月的用户)
- 在 user_id AND 月级别(分母 = 每个用户的 # 个月 (14),分子 = # 个用户有 >20 fact_count 1、2 和 3 为 mos。一行)
这是我从数据集中寻找的期望结果(注意:不是查询输出,只是结果):
+-------------------+------------+------------------+
| | User level | User-month level |
+-------------------+------------+------------------+
| 1 month | 3/3 | 10/42 |
+-------------------+------------+------------------+
| 2 months in a row | 2/3 | 4/42 |
+-------------------+------------+------------------+
| 3 months in a row | 2/3 | 3/42 |
+-------------------+------------+------------------+
这是我转换数据以正确汇总数据的方法,您可以在这个 Rextester link 中找到:http://rextester.com/DBF40287
- mem_month CTE 根据是否超过 fact_count >20,使用 CASE 创建 1 或 0 作为 'MET_NOTMET' 字段。此时有多行反映了不同的访问位置id。
- mem_month_rank CTE 通过对来自 mem_month 的 CASE 语句进行排名,使用 RANK 来查找用户的 fact_count 是否大于 20。如果用户在一个月内访问了三个位置,其中只有一个 fact_count >20,这将标识违规行。
这是这些查询的输出,越来越接近规范化 table:
| mem_month | | mem_month_rank |
|--------|------------|------------|---|----------------------------------------|
| user_id| year_month | met_notmet | | user_id| year_month | met_notmet | rnk |
| 1234 | 201601 | 0 | | 1234 | 201601 | 0 | 1 |
| 1234 | 201601 | 0 | | 1234 | 201602 | 0 | 1 |
| 1234 | 201602 | 0 | | 1234 | 201603 | 0 | 1 |
| 1234 | 201603 | 0 | | 1234 | 201604 | 0 | 1 |
| 1234 | 201604 | 0 | | 1234 | 201605 | 0 | 1 |
| 1234 | 201605 | 0 | | 1234 | 201605 | 1 | 2 |
| 1234 | 201605 | 1 | | 1234 | 201606 | 0 | 1 |
| 1234 | 201606 | 0 | | 1234 | 201606 | 1 | 2 |
| 1234 | 201606 | 1 | | 1234 | 201607 | 0 | 1 |
| 1234 | 201606 | 0 | | 1234 | 201607 | 1 | 2 |
| 1234 | 201607 | 1 | | 1234 | 201608 | 0 | 1 |
| 1234 | 201607 | 0 | | 1234 | 201609 | 0 | 1 |
| 1234 | 201607 | 1 | | 1234 | 201610 | 0 | 1 |
| 1234 | 201608 | 0 | | 1234 | 201611 | 0 | 1 |
| 1234 | 201609 | 0 | | 1234 | 201612 | 0 | 1 |
| 1234 | 201609 | 0 | | 1234 | 201701 | 0 | 1 |
| 1234 | 201610 | 0 | | 1234 | 201702 | 0 | 1 |
| 1234 | 201611 | 0 |
| 1234 | 201612 | 0 |
| 1234 | 201701 | 0 |
| 1234 | 201702 | 0 |
- 接下来,mem_month_norm CTE 选择 mem_month_rank,并内部连接到用户、月份和 max(rank) 的子查询,创建一个规范化的 table每个用户和每个月 1 行,说明他们是否在该月对任何位置有 >20 fact_count 的违规行为。
(列表中断)
| user_id | year_month | met_notmet |
|---------|------------|------------|
| 1234 | 201601 | 0 |
| 1234 | 201602 | 0 |
| 1234 | 201603 | 0 |
| 1234 | 201604 | 0 |
| 1234 | 201605 | 1 |
| 1234 | 201606 | 1 |
| 1234 | 201607 | 1 |
| 1234 | 201608 | 0 |
| 1234 | 201609 | 0 |
| 1234 | 201610 | 0 |
| 1234 | 201611 | 0 |
| 1234 | 201612 | 0 |
| 1234 | 201701 | 0 |
| 1234 | 201702 | 0 |
好的 - 这就是我需要您帮助的地方:我不知道计算连续违规值出现次数的最佳方法。我认为需要发生的是当用户连续 2 个月有违规值时进行某种排序 'reset'。
例如,计算时:
- 连续 2 个月 - user_id 1234 应该被计算一次,因为在 201605 和 201606 上连续出现两个违规 MET_NOTMET 值。如果他在 201608 中有违规值,我假设这 'reset' 需要发生,这样他就会被再次计算为连续 2 个月(从 07 到 08)。
- 连续 3 个月 - user_id 1234 应该被计为三个 MET_NOTMET 违规值(从 201605 到 201608)。同样,如果他从 201605 到 201610 连续有 6 个违规值,我假设会有一个 'reset' 将每组 3 个违规值计为 1 个违规值。
同样,link 是:http://rextester.com/DBF40287
如果您需要任何说明或其他信息,请告诉我,我会修改 post。谢谢!
这是一个非常复杂的问题,我不确定我是否理解它,但这是我对这种情况的理解的初步想法。
此查询应给出汇总数据级别 1 的结果:
WITH mem_months AS (
SELECT user_id, year_month, sum(fact_count) AS visits1
FROM tbl
GROUP BY 1,2
),
mem_consec AS (
SELECT user_id, year_month, visits1, visits2, visits3,
visits1 > 20 AS met1,
-- need case on these to account for nulls in the lag
CASE WHEN visits1 > 20 AND visits2 > 20 THEN TRUE ELSE FALSE END AS met2,
CASE WHEN visits1 > 20 AND visits2 > 20 AND visits3 > 20 THEN TRUE ELSE FALSE END AS met3
FROM (
SELECT user_id, year_month, visits1,
-- Seems like you might want lag instead of lead to the current "month".
-- Where 3 in a row was met, the attributed month
-- would be the 3rd in the sequence.
LAG(visits1, 1) OVER (PARTITION BY user_id ORDER BY year_month) AS visits2,
LAG(visits1, 2) OVER (PARTITION BY user_id ORDER BY year_month) AS visits3
FROM mem_months
) t
),
mem_compliance AS (
select user_id, bool_or(met1) as any1, bool_or(met2) as any2, bool_or(met3) as any3
from mem_consec
group by 1
)
select count(case when any1 then 1 else 0 end) / count(*) as user_level_1,
count(case when any2 then 1 else 0 end) / count(*) as user_level_2,
count(case when any3 then 1 else 0 end) / count(*) as user_level_3
from mem_compliance
并且,使用相同的 CTE,这将给出汇总数据级别 2:
select sum(case when met1 then 1 else 0 end) / count(distinct(year_month)) as user_month_level_1,
sum(case when met2 then 1 else 0 end) / count(distinct(year_month)) as user_month_level_2,
sum(case when met3 then 1 else 0 end) / count(distinct(year_month)) as user_month_level_3
from mem_consec
至此,我想我明白了您的实际问题...在这些 level_2 和 level_3 版本中,随着连续几个月的继续,计数会增加。对于每个 3-in-a-row,将计算两个 2-in-a-row。例如,连续四次你会得到两个 3 和三个 2。
这是一种获得非重叠总和的方法。必须是一种简化它的方法,但也许这是一个开始。这个想法是寻找 运行 的开始和结束月份,然后获取每个 运行 的长度并将其除以所需的集合大小(2 或 3)。未满足记录、中间满足记录和单个满足记录被丢弃。
mem_consec_sums AS (
SELECT user_id, sum(mon2s) AS mon2s, sum(mon3s) AS mon3s
FROM (
SELECT user_id,
((end_idx - start_idx)/2)::INT AS mon2s,
((end_idx - start_idx)/3)::INT AS mon3s
FROM (
SELECT user_id, year_month, met_start, idx AS start_idx,
LEAD(idx, 1) OVER (PARTITION BY user_id ORDER BY year_month) AS end_idx
FROM (
SELECT user_id, year_month, met1, idx,
CASE WHEN met1 AND pre_met IS NULL OR NOT pre_met) THEN TRUE ELSE FALSE END AS met_start,
CASE WHEN met1 AND post_met IS NULL OR NOT post_met) THEN TRUE ELSE FALSE END AS met_end
FROM (
SELECT user_id, year_month, met1,
LAG(met1, 1) OVER (PARTITION BY user_id ORDER BY year_month) AS pre_met,
LEAD(met1, 1) OVER (PARTITION BY user_id ORDER BY year_month) AS post_met,
ROW_NUMBER() OVER (PARTITION BY tvid ORDER BY year_month) AS idx
FROM mem_consec
)
) WHERE (met_start OR met_end) AND NOT (met_start AND met_end)
) WHERE met_start
) GROUP BY 1
)
我有一个从 1/2016 到 2/2017 的数据集,包含用户、月份、家庭位置、访问位置和访问次数。用户每月可以访问多个位置,或 none。该数据集每月包含多行,因为用户可以访问不同的位置。
这是数据,然后我将引导您完成下面的 Rextester link。
| user_id | year_month | homelocid | visitlocid | fact_count |
|---------|------------|-----------|------------|------------|
| 1234 | 201601 | L9900 | L9987 | 8 |
| 1234 | 201601 | L9900 | L4456 | 1 |
| 1234 | 201602 | L9900 | L9987 | 5 |
| 1234 | 201603 | L9900 | L9987 | 1 |
| 1234 | 201604 | L9900 | L9987 | 4 |
| 1234 | 201605 | L9900 | L9987 | 3 |
| 1234 | 201605 | L9900 | L4456 | 3 |
| 1234 | 201605 | L9900 | L5567 | 26 |
| 1234 | 201606 | L9900 | L4456 | 1 |
| 1234 | 201606 | L9900 | L9987 | 21 |
| 1234 | 201606 | L9900 | L6678 | 2 |
| 1234 | 201606 | L9900 | L5567 | 1 |
| 1234 | 201607 | L9900 | L9987 | 6 |
| 1234 | 201607 | L9900 | L5567 | 22 |
| 1234 | 201607 | L9900 | L6678 | 25 |
| 1234 | 201608 | NULL | NULL | NULL |
| 1234 | 201609 | L9900 | L9987 | 1 |
| 1234 | 201609 | L9900 | L4456 | 3 |
| 1234 | 201610 | L9900 | L9987 | 1 |
| 1234 | 201611 | NULL | NULL | NULL |
| 1234 | 201612 | NULL | NULL | NULL |
| 1234 | 201701 | NULL | NULL | NULL |
| 1234 | 201702 | L9900 | L9987 | 6 |
| 1234 | 201702 | L9900 | L2334 | 1 |
| 5556 | 201601 | L9900 | L3445 | 1 |
| 5556 | 201602 | L9900 | L3445 | 3 |
| 5556 | 201603 | L9900 | L3445 | 1 |
| 5556 | 201604 | L9900 | L3445 | 21 |
| 5556 | 201605 | L9900 | L3445 | 1 |
| 5556 | 201606 | L9900 | L3445 | 2 |
| 5556 | 201607 | NULL | NULL | NULL |
| 5556 | 201608 | NULL | NULL | NULL |
| 5556 | 201609 | L9900 | L3445 | 1 |
| 5556 | 201610 | NULL | NULL | NULL |
| 5556 | 201611 | NULL | NULL | NULL |
| 5556 | 201612 | L9900 | L3445 | 1 |
| 5556 | 201701 | NULL | NULL | NULL |
| 5556 | 201702 | L9900 | L3445 | 1 |
| 9988 | 201601 | L9900 | L2334 | 27 |
| 9988 | 201602 | L9900 | L3445 | 22 |
| 9988 | 201602 | L9900 | L9987 | 1 |
| 9988 | 201602 | L9900 | L2334 | 2 |
| 9988 | 201603 | L9900 | L4556 | 1 |
| 9988 | 201603 | L9900 | L2334 | 2 |
| 9988 | 201603 | L9900 | L3445 | 21 |
| 9988 | 201603 | L9900 | L9987 | 1 |
| 9988 | 201604 | L9900 | L3445 | 25 |
| 9988 | 201604 | L9900 | L2334 | 1 |
| 9988 | 201604 | L9900 | L4556 | 1 |
| 9988 | 201605 | L9900 | L3445 | 26 |
| 9988 | 201605 | L9900 | L2334 | 1 |
| 9988 | 201606 | L9900 | L3445 | 22 |
| 9988 | 201607 | L9900 | L3445 | 5 |
| 9988 | 201608 | L9900 | L3445 | 6 |
| 9988 | 201609 | L9900 | L3445 | 11 |
| 9988 | 201610 | L9900 | L3445 | 17 |
| 9988 | 201611 | L9900 | L3445 | 9 |
| 9988 | 201612 | L9900 | L2334 | 1 |
| 9988 | 201612 | L9900 | L3445 | 14 |
| 9988 | 201701 | L9900 | L3445 | 14 |
| 9988 | 201701 | L9900 | L2334 | 2 |
| 9988 | 201702 | L9900 | L4556 | 1 |
| 9988 | 201702 | L9900 | L3445 | 9 |
我正在尝试计算用户拥有的次数:
- 1 个月内超过 20 fact_count
- 连续2个月fact_count大于20
- 超过 20 fact_count 连续 3 个月
我还需要从两个层面总结数据:
- 在 user_id 水平上(分母 = # 个用户,分子 = # 个连续超过 20 fact_count 1、2 和 3 个月的用户)
- 在 user_id AND 月级别(分母 = 每个用户的 # 个月 (14),分子 = # 个用户有 >20 fact_count 1、2 和 3 为 mos。一行)
这是我从数据集中寻找的期望结果(注意:不是查询输出,只是结果):
+-------------------+------------+------------------+
| | User level | User-month level |
+-------------------+------------+------------------+
| 1 month | 3/3 | 10/42 |
+-------------------+------------+------------------+
| 2 months in a row | 2/3 | 4/42 |
+-------------------+------------+------------------+
| 3 months in a row | 2/3 | 3/42 |
+-------------------+------------+------------------+
这是我转换数据以正确汇总数据的方法,您可以在这个 Rextester link 中找到:http://rextester.com/DBF40287
- mem_month CTE 根据是否超过 fact_count >20,使用 CASE 创建 1 或 0 作为 'MET_NOTMET' 字段。此时有多行反映了不同的访问位置id。
- mem_month_rank CTE 通过对来自 mem_month 的 CASE 语句进行排名,使用 RANK 来查找用户的 fact_count 是否大于 20。如果用户在一个月内访问了三个位置,其中只有一个 fact_count >20,这将标识违规行。
这是这些查询的输出,越来越接近规范化 table:
| mem_month | | mem_month_rank |
|--------|------------|------------|---|----------------------------------------|
| user_id| year_month | met_notmet | | user_id| year_month | met_notmet | rnk |
| 1234 | 201601 | 0 | | 1234 | 201601 | 0 | 1 |
| 1234 | 201601 | 0 | | 1234 | 201602 | 0 | 1 |
| 1234 | 201602 | 0 | | 1234 | 201603 | 0 | 1 |
| 1234 | 201603 | 0 | | 1234 | 201604 | 0 | 1 |
| 1234 | 201604 | 0 | | 1234 | 201605 | 0 | 1 |
| 1234 | 201605 | 0 | | 1234 | 201605 | 1 | 2 |
| 1234 | 201605 | 1 | | 1234 | 201606 | 0 | 1 |
| 1234 | 201606 | 0 | | 1234 | 201606 | 1 | 2 |
| 1234 | 201606 | 1 | | 1234 | 201607 | 0 | 1 |
| 1234 | 201606 | 0 | | 1234 | 201607 | 1 | 2 |
| 1234 | 201607 | 1 | | 1234 | 201608 | 0 | 1 |
| 1234 | 201607 | 0 | | 1234 | 201609 | 0 | 1 |
| 1234 | 201607 | 1 | | 1234 | 201610 | 0 | 1 |
| 1234 | 201608 | 0 | | 1234 | 201611 | 0 | 1 |
| 1234 | 201609 | 0 | | 1234 | 201612 | 0 | 1 |
| 1234 | 201609 | 0 | | 1234 | 201701 | 0 | 1 |
| 1234 | 201610 | 0 | | 1234 | 201702 | 0 | 1 |
| 1234 | 201611 | 0 |
| 1234 | 201612 | 0 |
| 1234 | 201701 | 0 |
| 1234 | 201702 | 0 |
- 接下来,mem_month_norm CTE 选择 mem_month_rank,并内部连接到用户、月份和 max(rank) 的子查询,创建一个规范化的 table每个用户和每个月 1 行,说明他们是否在该月对任何位置有 >20 fact_count 的违规行为。
(列表中断)
| user_id | year_month | met_notmet |
|---------|------------|------------|
| 1234 | 201601 | 0 |
| 1234 | 201602 | 0 |
| 1234 | 201603 | 0 |
| 1234 | 201604 | 0 |
| 1234 | 201605 | 1 |
| 1234 | 201606 | 1 |
| 1234 | 201607 | 1 |
| 1234 | 201608 | 0 |
| 1234 | 201609 | 0 |
| 1234 | 201610 | 0 |
| 1234 | 201611 | 0 |
| 1234 | 201612 | 0 |
| 1234 | 201701 | 0 |
| 1234 | 201702 | 0 |
好的 - 这就是我需要您帮助的地方:我不知道计算连续违规值出现次数的最佳方法。我认为需要发生的是当用户连续 2 个月有违规值时进行某种排序 'reset'。
例如,计算时:
- 连续 2 个月 - user_id 1234 应该被计算一次,因为在 201605 和 201606 上连续出现两个违规 MET_NOTMET 值。如果他在 201608 中有违规值,我假设这 'reset' 需要发生,这样他就会被再次计算为连续 2 个月(从 07 到 08)。
- 连续 3 个月 - user_id 1234 应该被计为三个 MET_NOTMET 违规值(从 201605 到 201608)。同样,如果他从 201605 到 201610 连续有 6 个违规值,我假设会有一个 'reset' 将每组 3 个违规值计为 1 个违规值。
同样,link 是:http://rextester.com/DBF40287
如果您需要任何说明或其他信息,请告诉我,我会修改 post。谢谢!
这是一个非常复杂的问题,我不确定我是否理解它,但这是我对这种情况的理解的初步想法。
此查询应给出汇总数据级别 1 的结果:
WITH mem_months AS (
SELECT user_id, year_month, sum(fact_count) AS visits1
FROM tbl
GROUP BY 1,2
),
mem_consec AS (
SELECT user_id, year_month, visits1, visits2, visits3,
visits1 > 20 AS met1,
-- need case on these to account for nulls in the lag
CASE WHEN visits1 > 20 AND visits2 > 20 THEN TRUE ELSE FALSE END AS met2,
CASE WHEN visits1 > 20 AND visits2 > 20 AND visits3 > 20 THEN TRUE ELSE FALSE END AS met3
FROM (
SELECT user_id, year_month, visits1,
-- Seems like you might want lag instead of lead to the current "month".
-- Where 3 in a row was met, the attributed month
-- would be the 3rd in the sequence.
LAG(visits1, 1) OVER (PARTITION BY user_id ORDER BY year_month) AS visits2,
LAG(visits1, 2) OVER (PARTITION BY user_id ORDER BY year_month) AS visits3
FROM mem_months
) t
),
mem_compliance AS (
select user_id, bool_or(met1) as any1, bool_or(met2) as any2, bool_or(met3) as any3
from mem_consec
group by 1
)
select count(case when any1 then 1 else 0 end) / count(*) as user_level_1,
count(case when any2 then 1 else 0 end) / count(*) as user_level_2,
count(case when any3 then 1 else 0 end) / count(*) as user_level_3
from mem_compliance
并且,使用相同的 CTE,这将给出汇总数据级别 2:
select sum(case when met1 then 1 else 0 end) / count(distinct(year_month)) as user_month_level_1,
sum(case when met2 then 1 else 0 end) / count(distinct(year_month)) as user_month_level_2,
sum(case when met3 then 1 else 0 end) / count(distinct(year_month)) as user_month_level_3
from mem_consec
至此,我想我明白了您的实际问题...在这些 level_2 和 level_3 版本中,随着连续几个月的继续,计数会增加。对于每个 3-in-a-row,将计算两个 2-in-a-row。例如,连续四次你会得到两个 3 和三个 2。
这是一种获得非重叠总和的方法。必须是一种简化它的方法,但也许这是一个开始。这个想法是寻找 运行 的开始和结束月份,然后获取每个 运行 的长度并将其除以所需的集合大小(2 或 3)。未满足记录、中间满足记录和单个满足记录被丢弃。
mem_consec_sums AS (
SELECT user_id, sum(mon2s) AS mon2s, sum(mon3s) AS mon3s
FROM (
SELECT user_id,
((end_idx - start_idx)/2)::INT AS mon2s,
((end_idx - start_idx)/3)::INT AS mon3s
FROM (
SELECT user_id, year_month, met_start, idx AS start_idx,
LEAD(idx, 1) OVER (PARTITION BY user_id ORDER BY year_month) AS end_idx
FROM (
SELECT user_id, year_month, met1, idx,
CASE WHEN met1 AND pre_met IS NULL OR NOT pre_met) THEN TRUE ELSE FALSE END AS met_start,
CASE WHEN met1 AND post_met IS NULL OR NOT post_met) THEN TRUE ELSE FALSE END AS met_end
FROM (
SELECT user_id, year_month, met1,
LAG(met1, 1) OVER (PARTITION BY user_id ORDER BY year_month) AS pre_met,
LEAD(met1, 1) OVER (PARTITION BY user_id ORDER BY year_month) AS post_met,
ROW_NUMBER() OVER (PARTITION BY tvid ORDER BY year_month) AS idx
FROM mem_consec
)
) WHERE (met_start OR met_end) AND NOT (met_start AND met_end)
) WHERE met_start
) GROUP BY 1
)