根据 Snowflake 中的日期和 Window 函数过滤查询
Filtering a Query based on a Date and Window function in Snowflake
我被要求提取有关去年三种不同类型客户的信息(访问过一次、访问过 <10 次和访问过 10 次以上),看看他们回头的可能性与几个不同的因素相比。
出于这个原因,我创建了一个非常广泛的查询。目前我有三张表的联合查询:客户信息、访问信息和员工信息。我在 select 语句中创建了一个计算列:
COUNT(DISTINCT visitno) OVER(PARTITION BY clientid) as totalvisits
现在我只需要按总访问量分组并按访问日期过滤。
我试过了:
where visitdate> 01/01/2021
group by totalvisits
having total visits<10
但是我得到一个错误,指出 visitno 不是一个有效的表达式组。
我可能做错了什么?
在snowflake中,可以使用QUALIFY子句过滤window函数postwindow聚合
因此,查询将如下所示:
SELECT
clientid,
COUNT(DISTINCT visitno) OVER(PARTITION BY clientid) as totalvisits
FROM <your_table>
WHERE visitdate >= 2021-01-01::date
AND visitdate < 2022-01-01::date
QUALIFY totalvisits < 10;
*确保 visitdate
事先有日期类型!
[参考下面的评论]:如果你想查看历史总访问量,加上给定年份的总访问量,你可以这样做以下:
SELECT
clientid,
YEAR(visitdate) as visit_date_year,
COUNT(DISTINCT visitno) OVER (PARTITION BY clientid) as totalvisits,
COUNT(DISTINCT visitno) OVER (PARTITION BY clientid, YEAR(visitdate) as total_visits_by_year
FROM <your_table>
QUALIFY total_visits_by_year < 10;
好的,让我们制作一些假数据,然后进行计数:
WITH fake_data(client_id, visit_date) as (
SELECT * FROM VALUES
-- this person has visted once
(1, '2022-04-14'::date),
-- this person has visited 3 timw in the year
(3, '2022-04-13'::date),
(3, '2022-03-13'::date),
(3, '2022-02-13'::date),
-- this person is a huge vistor, but 1 is outside the with in last year.
(5, '2022-04-12'::date),
(5, '2022-03-12'::date),
(5, '2022-02-12'::date),
(5, '2022-01-12'::date),
(5, '2020-02-12'::date)
)
SELECT *,
count(distinct visit_date) over (partition by client_id) as total_visits
FROM fake_data
WHERE visit_date >= dateadd('year', -1, '2022-04-14' /* CURRENT_DATE */)
繁荣:
CLIENT_ID
VISIT_DATE
TOTAL_VISITS
1
2022-04-14
1
3
2022-04-13
3
3
2022-03-13
3
3
2022-02-13
3
5
2022-04-12
4
5
2022-03-12
4
5
2022-02-12
4
现在把它们变成你的那些 group/categories。
SELECT *,
count(distinct visit_date) over (partition by client_id) as total_visits,
case
when total_visits = 1 then 1
when total_visits <= 3 then 2
when total_visits > 3 then 3
end as group_id
FROM fake_data
WHERE visit_date >= dateadd('year', -1, '2022-04-14' /* CURRENT_DATE */)
现在是一些数学,我会将其包装到一个 sub-select 中(但也会将一些东西压入其中)
WITH fake_data(client_id, visit_date) as (
SELECT * FROM VALUES
-- this person has visted once
(1, '2022-04-14'::date),
-- this person has visited 3 timw in the year
(3, '2022-04-13'::date),
(3, '2022-04-11'::date),
(3, '2022-04-09'::date),
-- this person is a huge vistor, but 1 is outside the with in last year.
(5, '2022-04-12'::date),
(5, '2022-03-12'::date),
(5, '2022-02-12'::date),
(5, '2022-01-12'::date),
(5, '2020-02-12'::date)
)
SELECT group_id
,count(distinct client_id) as count_of_group_members
,sum(total_visits) as sum_of_group_visit
,avg(visit_gap_in_days) as avg_group_day_diff
,stddev(visit_gap_in_days) as stddev_group_day_diff
FROM (
SELECT *,
count(distinct visit_date) over (partition by client_id) as total_visits,
case
when total_visits = 1 then 1
when total_visits <= 3 then 2
when total_visits > 3 then 3
end as group_id,
lag(visit_date) over (partition by client_id order by visit_date) as prior_visit_date,
datediff('day', prior_visit_date, visit_date) as visit_gap_in_days
FROM fake_data
WHERE visit_date >= dateadd('year', -1, '2022-04-14' /* CURRENT_DATE */)
)
GROUP BY 1
ORDER BY 1
GROUP_ID
COUNT_OF_GROUP_MEMBERS
SUM_OF_GROUP_VISIT
AVG_GROUP_DAY_DIFF
STDDEV_GROUP_DAY_DIFF
1
1
1
2
1
9
2
0
3
1
16
30
1.732050808
Wozers,那个访问量是错误的,我已经总结了..
所以这里给定 count(distinct visitno)
我不能求和,因为它变成了总和,而且我不能做计数 (*) 因为我们刚刚注意到有重复项(否则不同的不是需要)。而且我假设您没有删除行,因为有一些“您需要的其他详细信息”
但无论如何。这是 SQL 的伟大之处,您可以回答任何问题,但您必须知道问题和数据,这样您才能知道哪些假设可以适用于您的数据。
我被要求提取有关去年三种不同类型客户的信息(访问过一次、访问过 <10 次和访问过 10 次以上),看看他们回头的可能性与几个不同的因素相比。
出于这个原因,我创建了一个非常广泛的查询。目前我有三张表的联合查询:客户信息、访问信息和员工信息。我在 select 语句中创建了一个计算列:
COUNT(DISTINCT visitno) OVER(PARTITION BY clientid) as totalvisits
现在我只需要按总访问量分组并按访问日期过滤。
我试过了:
where visitdate> 01/01/2021
group by totalvisits
having total visits<10
但是我得到一个错误,指出 visitno 不是一个有效的表达式组。
我可能做错了什么?
在snowflake中,可以使用QUALIFY子句过滤window函数postwindow聚合
因此,查询将如下所示:
SELECT
clientid,
COUNT(DISTINCT visitno) OVER(PARTITION BY clientid) as totalvisits
FROM <your_table>
WHERE visitdate >= 2021-01-01::date
AND visitdate < 2022-01-01::date
QUALIFY totalvisits < 10;
*确保 visitdate
事先有日期类型!
[参考下面的评论]:如果你想查看历史总访问量,加上给定年份的总访问量,你可以这样做以下:
SELECT
clientid,
YEAR(visitdate) as visit_date_year,
COUNT(DISTINCT visitno) OVER (PARTITION BY clientid) as totalvisits,
COUNT(DISTINCT visitno) OVER (PARTITION BY clientid, YEAR(visitdate) as total_visits_by_year
FROM <your_table>
QUALIFY total_visits_by_year < 10;
好的,让我们制作一些假数据,然后进行计数:
WITH fake_data(client_id, visit_date) as (
SELECT * FROM VALUES
-- this person has visted once
(1, '2022-04-14'::date),
-- this person has visited 3 timw in the year
(3, '2022-04-13'::date),
(3, '2022-03-13'::date),
(3, '2022-02-13'::date),
-- this person is a huge vistor, but 1 is outside the with in last year.
(5, '2022-04-12'::date),
(5, '2022-03-12'::date),
(5, '2022-02-12'::date),
(5, '2022-01-12'::date),
(5, '2020-02-12'::date)
)
SELECT *,
count(distinct visit_date) over (partition by client_id) as total_visits
FROM fake_data
WHERE visit_date >= dateadd('year', -1, '2022-04-14' /* CURRENT_DATE */)
繁荣:
CLIENT_ID | VISIT_DATE | TOTAL_VISITS |
---|---|---|
1 | 2022-04-14 | 1 |
3 | 2022-04-13 | 3 |
3 | 2022-03-13 | 3 |
3 | 2022-02-13 | 3 |
5 | 2022-04-12 | 4 |
5 | 2022-03-12 | 4 |
5 | 2022-02-12 | 4 |
现在把它们变成你的那些 group/categories。
SELECT *,
count(distinct visit_date) over (partition by client_id) as total_visits,
case
when total_visits = 1 then 1
when total_visits <= 3 then 2
when total_visits > 3 then 3
end as group_id
FROM fake_data
WHERE visit_date >= dateadd('year', -1, '2022-04-14' /* CURRENT_DATE */)
现在是一些数学,我会将其包装到一个 sub-select 中(但也会将一些东西压入其中)
WITH fake_data(client_id, visit_date) as (
SELECT * FROM VALUES
-- this person has visted once
(1, '2022-04-14'::date),
-- this person has visited 3 timw in the year
(3, '2022-04-13'::date),
(3, '2022-04-11'::date),
(3, '2022-04-09'::date),
-- this person is a huge vistor, but 1 is outside the with in last year.
(5, '2022-04-12'::date),
(5, '2022-03-12'::date),
(5, '2022-02-12'::date),
(5, '2022-01-12'::date),
(5, '2020-02-12'::date)
)
SELECT group_id
,count(distinct client_id) as count_of_group_members
,sum(total_visits) as sum_of_group_visit
,avg(visit_gap_in_days) as avg_group_day_diff
,stddev(visit_gap_in_days) as stddev_group_day_diff
FROM (
SELECT *,
count(distinct visit_date) over (partition by client_id) as total_visits,
case
when total_visits = 1 then 1
when total_visits <= 3 then 2
when total_visits > 3 then 3
end as group_id,
lag(visit_date) over (partition by client_id order by visit_date) as prior_visit_date,
datediff('day', prior_visit_date, visit_date) as visit_gap_in_days
FROM fake_data
WHERE visit_date >= dateadd('year', -1, '2022-04-14' /* CURRENT_DATE */)
)
GROUP BY 1
ORDER BY 1
GROUP_ID | COUNT_OF_GROUP_MEMBERS | SUM_OF_GROUP_VISIT | AVG_GROUP_DAY_DIFF | STDDEV_GROUP_DAY_DIFF |
---|---|---|---|---|
1 | 1 | 1 | ||
2 | 1 | 9 | 2 | 0 |
3 | 1 | 16 | 30 | 1.732050808 |
Wozers,那个访问量是错误的,我已经总结了..
所以这里给定 count(distinct visitno)
我不能求和,因为它变成了总和,而且我不能做计数 (*) 因为我们刚刚注意到有重复项(否则不同的不是需要)。而且我假设您没有删除行,因为有一些“您需要的其他详细信息”
但无论如何。这是 SQL 的伟大之处,您可以回答任何问题,但您必须知道问题和数据,这样您才能知道哪些假设可以适用于您的数据。