您如何查找和计算满足序列的出现次数?

How do you find and count occurrences of sequences being met?

我有一个从 1/2016 到 2/2017 的数据集,包含用户、月份、家庭位置、访问位置和访问次数。用户每月可以访问多个位置,或 none。该数据集每月包含多行,因为用户可以访问不同的位置。

这是数据,然后我将引导您完成下面的 Rextester link。

| user_id | year_month | homelocid | visitlocid | fact_count |
|---------|------------|-----------|------------|------------|
| 1234    | 201601     | L9900     | L9987      | 8          |
| 1234    | 201601     | L9900     | L4456      | 1          |
| 1234    | 201602     | L9900     | L9987      | 5          |
| 1234    | 201603     | L9900     | L9987      | 1          |
| 1234    | 201604     | L9900     | L9987      | 4          |
| 1234    | 201605     | L9900     | L9987      | 3          |
| 1234    | 201605     | L9900     | L4456      | 3          |
| 1234    | 201605     | L9900     | L5567      | 26         |
| 1234    | 201606     | L9900     | L4456      | 1          |
| 1234    | 201606     | L9900     | L9987      | 21         |
| 1234    | 201606     | L9900     | L6678      | 2          |
| 1234    | 201606     | L9900     | L5567      | 1          |
| 1234    | 201607     | L9900     | L9987      | 6          |
| 1234    | 201607     | L9900     | L5567      | 22         |
| 1234    | 201607     | L9900     | L6678      | 25         |
| 1234    | 201608     | NULL      | NULL       | NULL       |
| 1234    | 201609     | L9900     | L9987      | 1          |
| 1234    | 201609     | L9900     | L4456      | 3          |
| 1234    | 201610     | L9900     | L9987      | 1          |
| 1234    | 201611     | NULL      | NULL       | NULL       |
| 1234    | 201612     | NULL      | NULL       | NULL       |
| 1234    | 201701     | NULL      | NULL       | NULL       |
| 1234    | 201702     | L9900     | L9987      | 6          |
| 1234    | 201702     | L9900     | L2334      | 1          |
| 5556    | 201601     | L9900     | L3445      | 1          |
| 5556    | 201602     | L9900     | L3445      | 3          |
| 5556    | 201603     | L9900     | L3445      | 1          |
| 5556    | 201604     | L9900     | L3445      | 21         |
| 5556    | 201605     | L9900     | L3445      | 1          |
| 5556    | 201606     | L9900     | L3445      | 2          |
| 5556    | 201607     | NULL      | NULL       | NULL       |
| 5556    | 201608     | NULL      | NULL       | NULL       |
| 5556    | 201609     | L9900     | L3445      | 1          |
| 5556    | 201610     | NULL      | NULL       | NULL       |
| 5556    | 201611     | NULL      | NULL       | NULL       |
| 5556    | 201612     | L9900     | L3445      | 1          |
| 5556    | 201701     | NULL      | NULL       | NULL       |
| 5556    | 201702     | L9900     | L3445      | 1          |
| 9988    | 201601     | L9900     | L2334      | 27         |
| 9988    | 201602     | L9900     | L3445      | 22         |
| 9988    | 201602     | L9900     | L9987      | 1          |
| 9988    | 201602     | L9900     | L2334      | 2          |
| 9988    | 201603     | L9900     | L4556      | 1          |
| 9988    | 201603     | L9900     | L2334      | 2          |
| 9988    | 201603     | L9900     | L3445      | 21         |
| 9988    | 201603     | L9900     | L9987      | 1          |
| 9988    | 201604     | L9900     | L3445      | 25         |
| 9988    | 201604     | L9900     | L2334      | 1          |
| 9988    | 201604     | L9900     | L4556      | 1          |
| 9988    | 201605     | L9900     | L3445      | 26         |
| 9988    | 201605     | L9900     | L2334      | 1          |
| 9988    | 201606     | L9900     | L3445      | 22         |
| 9988    | 201607     | L9900     | L3445      | 5          |
| 9988    | 201608     | L9900     | L3445      | 6          |
| 9988    | 201609     | L9900     | L3445      | 11         |
| 9988    | 201610     | L9900     | L3445      | 17         |
| 9988    | 201611     | L9900     | L3445      | 9          |
| 9988    | 201612     | L9900     | L2334      | 1          |
| 9988    | 201612     | L9900     | L3445      | 14         |
| 9988    | 201701     | L9900     | L3445      | 14         |
| 9988    | 201701     | L9900     | L2334      | 2          |
| 9988    | 201702     | L9900     | L4556      | 1          |
| 9988    | 201702     | L9900     | L3445      | 9          |

我正在尝试计算用户拥有的次数:

  1. 1 个月内超过 20 fact_count
  2. 连续2个月fact_count大于20
  3. 超过 20 fact_count 连续 3 个月

我还需要从两个层面总结数据:

  1. 在 user_id 水平上(分母 = # 个用户,分子 = # 个连续超过 20 fact_count 1、2 和 3 个月的用户)
  2. 在 user_id AND 月级别(分母 = 每个用户的 # 个月 (14),分子 = # 个用户有 >20 fact_count 1、2 和 3 为 mos。一行)

这是我从数据集中寻找的期望结果(注意:不是查询输出,只是结果):

+-------------------+------------+------------------+
|                   | User level | User-month level |
+-------------------+------------+------------------+
| 1 month           | 3/3        | 10/42            |
+-------------------+------------+------------------+
| 2 months in a row | 2/3        | 4/42             |
+-------------------+------------+------------------+
| 3 months in a row | 2/3        | 3/42             |
+-------------------+------------+------------------+

这是我转换数据以正确汇总数据的方法,您可以在这个 Rextester link 中找到:http://rextester.com/DBF40287

  1. mem_month CTE 根据是否超过 fact_count >20,使用 CASE 创建 1 或 0 作为 'MET_NOTMET' 字段。此时有多行反映了不同的访问位置id。
  2. mem_month_rank CTE 通过对来自 mem_month 的 CASE 语句进行排名,使用 RANK 来查找用户的 fact_count 是否大于 20。如果用户在一个月内访问了三个位置,其中只有一个 fact_count >20,这将标识违规行。

这是这些查询的输出,越来越接近规范化 table:

| mem_month                        |   | mem_month_rank                         |
|--------|------------|------------|---|----------------------------------------|
| user_id| year_month | met_notmet |   | user_id| year_month | met_notmet | rnk |
| 1234   | 201601     | 0          |   | 1234   | 201601     | 0          | 1   |
| 1234   | 201601     | 0          |   | 1234   | 201602     | 0          | 1   |
| 1234   | 201602     | 0          |   | 1234   | 201603     | 0          | 1   |
| 1234   | 201603     | 0          |   | 1234   | 201604     | 0          | 1   |
| 1234   | 201604     | 0          |   | 1234   | 201605     | 0          | 1   |
| 1234   | 201605     | 0          |   | 1234   | 201605     | 1          | 2   |
| 1234   | 201605     | 1          |   | 1234   | 201606     | 0          | 1   |
| 1234   | 201606     | 0          |   | 1234   | 201606     | 1          | 2   |
| 1234   | 201606     | 1          |   | 1234   | 201607     | 0          | 1   |
| 1234   | 201606     | 0          |   | 1234   | 201607     | 1          | 2   |
| 1234   | 201607     | 1          |   | 1234   | 201608     | 0          | 1   |
| 1234   | 201607     | 0          |   | 1234   | 201609     | 0          | 1   |
| 1234   | 201607     | 1          |   | 1234   | 201610     | 0          | 1   |
| 1234   | 201608     | 0          |   | 1234   | 201611     | 0          | 1   |
| 1234   | 201609     | 0          |   | 1234   | 201612     | 0          | 1   |
| 1234   | 201609     | 0          |   | 1234   | 201701     | 0          | 1   |
| 1234   | 201610     | 0          |   | 1234   | 201702     | 0          | 1   |
| 1234   | 201611     | 0          |
| 1234   | 201612     | 0          |
| 1234   | 201701     | 0          |
| 1234   | 201702     | 0          |
  1. 接下来,mem_month_norm CTE 选择 mem_month_rank,并内部连接到用户、月份和 max(rank) 的子查询,创建一个规范化的 table每个用户和每个月 1 行,说明他们是否在该月对任何位置有 >20 fact_count 的违规行为。

(列表中断)

| user_id | year_month | met_notmet |
|---------|------------|------------|
| 1234    | 201601     | 0          |
| 1234    | 201602     | 0          |
| 1234    | 201603     | 0          |
| 1234    | 201604     | 0          |
| 1234    | 201605     | 1          |
| 1234    | 201606     | 1          |
| 1234    | 201607     | 1          |
| 1234    | 201608     | 0          |
| 1234    | 201609     | 0          |
| 1234    | 201610     | 0          |
| 1234    | 201611     | 0          |
| 1234    | 201612     | 0          |
| 1234    | 201701     | 0          |
| 1234    | 201702     | 0          |

好的 - 这就是我需要您帮助的地方:我不知道计算连续违规值出现次数的最佳方法。我认为需要发生的是当用户连续 2 个月有违规值时进行某种排序 'reset'。

例如,计算时:

  1. 连续 2 个月 - user_id 1234 应该被计算一次,因为在 201605 和 201606 上连续出现两个违规 MET_NOTMET 值。如果他在 201608 中有违规值,我假设这 'reset' 需要发生,这样他就会被再次计算为连续 2 个月(从 07 到 08)。
  2. 连续 3 个月 - user_id 1234 应该被计为三个 MET_NOTMET 违规值(从 201605 到 201608)。同样,如果他从 201605 到 201610 连续有 6 个违规值,我假设会有一个 'reset' 将每组 3 个违规值计为 1 个违规值。

同样,link 是:http://rextester.com/DBF40287

如果您需要任何说明或其他信息,请告诉我,我会修改 post。谢谢!

这是一个非常复杂的问题,我不确定我是否理解它,但这是我对这种情况的理解的初步想法。

此查询应给出汇总数据级别 1 的结果:

WITH mem_months AS (
    SELECT user_id, year_month, sum(fact_count) AS visits1
    FROM tbl
    GROUP BY 1,2
),
mem_consec AS (
    SELECT user_id, year_month, visits1, visits2, visits3,
        visits1 > 20 AS met1,
        -- need case on these to account for nulls in the lag
        CASE WHEN visits1 > 20 AND visits2 > 20 THEN TRUE ELSE FALSE END AS met2,
        CASE WHEN visits1 > 20 AND visits2 > 20 AND visits3 > 20 THEN TRUE ELSE FALSE END AS met3
    FROM (
        SELECT user_id, year_month, visits1,
            -- Seems like you might want lag instead of lead to the current "month".
            -- Where 3 in a row was met, the attributed month
            -- would be the 3rd in the sequence.
            LAG(visits1, 1) OVER (PARTITION BY user_id ORDER BY year_month) AS visits2,
            LAG(visits1, 2) OVER (PARTITION BY user_id ORDER BY year_month) AS visits3
        FROM mem_months
    ) t
),
mem_compliance AS (
    select user_id, bool_or(met1) as any1, bool_or(met2) as any2, bool_or(met3) as any3
    from mem_consec
    group by 1
)
select count(case when any1 then 1 else 0 end) / count(*) as user_level_1,
    count(case when any2 then 1 else 0 end) / count(*) as user_level_2,
    count(case when any3 then 1 else 0 end) / count(*) as user_level_3
from mem_compliance

并且,使用相同的 CTE,这将给出汇总数据级别 2:

select sum(case when met1 then 1 else 0 end) / count(distinct(year_month)) as user_month_level_1,
    sum(case when met2 then 1 else 0 end) / count(distinct(year_month)) as user_month_level_2,
    sum(case when met3 then 1 else 0 end) / count(distinct(year_month)) as user_month_level_3
from mem_consec

至此,我想我明白了您的实际问题...在这些 level_2 和 level_3 版本中,随着连续几个月的继续,计数会增加。对于每个 3-in-a-row,将计算两个 2-in-a-row。例如,连续四次你会得到两个 3 和三个 2。

这是一种获得非重叠总和的方法。必须是一种简化它的方法,但也许这是一个开始。这个想法是寻找 运行 的开始和结束月份,然后获取每个 运行 的长度并将其除以所需的集合大小(2 或 3)。未满足记录、中间满足记录和单个满足记录被丢弃。

mem_consec_sums AS (
    SELECT user_id, sum(mon2s) AS mon2s, sum(mon3s) AS mon3s
    FROM (
        SELECT user_id,
            ((end_idx - start_idx)/2)::INT AS mon2s,
            ((end_idx - start_idx)/3)::INT AS mon3s
        FROM (
            SELECT user_id, year_month, met_start, idx AS start_idx,
                LEAD(idx, 1) OVER (PARTITION BY user_id ORDER BY year_month) AS end_idx
            FROM (
                SELECT user_id, year_month, met1, idx,
                    CASE WHEN met1 AND pre_met IS NULL OR NOT pre_met) THEN TRUE ELSE FALSE END AS met_start,
                    CASE WHEN met1 AND post_met IS NULL OR NOT post_met) THEN TRUE ELSE FALSE END AS met_end
                FROM (
                    SELECT user_id, year_month, met1,
                        LAG(met1, 1) OVER (PARTITION BY user_id ORDER BY year_month) AS pre_met,
                        LEAD(met1, 1) OVER (PARTITION BY user_id ORDER BY year_month) AS post_met,
                        ROW_NUMBER() OVER (PARTITION BY tvid ORDER BY year_month) AS idx
                    FROM mem_consec
                )
            ) WHERE (met_start OR met_end) AND NOT (met_start AND met_end)
        ) WHERE met_start
    ) GROUP BY 1
)