根据 Snowflake 中的日期和 Window 函数过滤查询

Filtering a Query based on a Date and Window function in Snowflake

我被要求提取有关去年三种不同类型客户的信息(访问过一次、访问过 <10 次和访问过 10 次以上),看看他们回头的可能性与几个不同的因素相比。

出于这个原因,我创建了一个非常广泛的查询。目前我有三张表的联合查询:客户信息、访问信息和员工信息。我在 select 语句中创建了一个计算列:

COUNT(DISTINCT visitno) OVER(PARTITION BY clientid) as totalvisits

现在我只需要按总访问量分组并按访问日期过滤。

我试过了:

where visitdate> 01/01/2021
group by totalvisits
having total visits<10

但是我得到一个错误,指出 visitno 不是一个有效的表达式组。

我可能做错了什么?

在snowflake中,可以使用QUALIFY子句过滤window函数postwindow聚合

因此,查询将如下所示:

SELECT
  clientid,
  COUNT(DISTINCT visitno) OVER(PARTITION BY clientid) as totalvisits
FROM <your_table>
WHERE visitdate >= 2021-01-01::date
  AND visitdate < 2022-01-01::date
QUALIFY totalvisits < 10;

*确保 visitdate 事先有日期类型!

[参考下面的评论]:如果你想查看历史总访问量,加上给定年份的总访问量,你可以这样做以下:

SELECT
  clientid,
  YEAR(visitdate) as visit_date_year,
  COUNT(DISTINCT visitno) OVER (PARTITION BY clientid) as totalvisits,
  COUNT(DISTINCT visitno) OVER (PARTITION BY clientid, YEAR(visitdate) as total_visits_by_year
FROM <your_table>
QUALIFY total_visits_by_year < 10;

好的,让我们制作一些假数据,然后进行计数:

WITH fake_data(client_id, visit_date) as (
    SELECT * FROM VALUES
    -- this person has visted once
    (1, '2022-04-14'::date),
    -- this person has visited 3 timw in the year
    (3, '2022-04-13'::date),
    (3, '2022-03-13'::date),
    (3, '2022-02-13'::date),
    -- this person is a huge vistor, but 1 is outside the with in last year.
    (5, '2022-04-12'::date),
    (5, '2022-03-12'::date),
    (5, '2022-02-12'::date),
    (5, '2022-01-12'::date),
    (5, '2020-02-12'::date)
)
SELECT *,
    count(distinct visit_date) over (partition by client_id) as total_visits
FROM fake_data
WHERE visit_date >= dateadd('year', -1, '2022-04-14' /* CURRENT_DATE */)

繁荣:

CLIENT_ID VISIT_DATE TOTAL_VISITS
1 2022-04-14 1
3 2022-04-13 3
3 2022-03-13 3
3 2022-02-13 3
5 2022-04-12 4
5 2022-03-12 4
5 2022-02-12 4

现在把它们变成你的那些 group/categories。

SELECT *,
    count(distinct visit_date) over (partition by client_id) as total_visits,
    case 
        when total_visits = 1 then 1
        when total_visits <= 3 then 2
        when total_visits > 3 then 3
    end as group_id
FROM fake_data
WHERE visit_date >= dateadd('year', -1, '2022-04-14' /* CURRENT_DATE */)

现在是一些数学,我会将其包装到一个 sub-select 中(但也会将一些东西压入其中)

WITH fake_data(client_id, visit_date) as (
    SELECT * FROM VALUES
    -- this person has visted once
    (1, '2022-04-14'::date),
    -- this person has visited 3 timw in the year
    (3, '2022-04-13'::date),
    (3, '2022-04-11'::date),
    (3, '2022-04-09'::date),
    -- this person is a huge vistor, but 1 is outside the with in last year.
    (5, '2022-04-12'::date),
    (5, '2022-03-12'::date),
    (5, '2022-02-12'::date),
    (5, '2022-01-12'::date),
    (5, '2020-02-12'::date)
)
SELECT group_id
    ,count(distinct client_id) as count_of_group_members
    ,sum(total_visits) as sum_of_group_visit
    ,avg(visit_gap_in_days) as avg_group_day_diff
    ,stddev(visit_gap_in_days) as stddev_group_day_diff
FROM (
SELECT *,
    count(distinct visit_date) over (partition by client_id) as total_visits,
    case 
        when total_visits = 1 then 1
        when total_visits <= 3 then 2
        when total_visits > 3 then 3
    end as group_id,
    lag(visit_date) over (partition by client_id order by visit_date) as prior_visit_date,
    datediff('day', prior_visit_date, visit_date) as visit_gap_in_days
FROM fake_data
WHERE visit_date >= dateadd('year', -1, '2022-04-14' /* CURRENT_DATE */)
)
GROUP BY 1
ORDER BY 1
GROUP_ID COUNT_OF_GROUP_MEMBERS SUM_OF_GROUP_VISIT AVG_GROUP_DAY_DIFF STDDEV_GROUP_DAY_DIFF
1 1 1
2 1 9 2 0
3 1 16 30 1.732050808

Wozers,那个访问量是错误的,我已经总结了..

所以这里给定 count(distinct visitno) 我不能求和,因为它变成了总和,而且我不能做计数 (*) 因为我们刚刚注意到有重复项(否则不同的不是需要)。而且我假设您没有删除行,因为有一些“您需要的其他详细信息”

但无论如何。这是 SQL 的伟大之处,您可以回答任何问题,但您必须知道问题和数据,这样您才能知道哪些假设可以适用于您的数据。