SQL 计算 window 中出现的次数
SQL count occurrences in window
我有用户登录日期。我的要求是跟踪过去 90 天内登录的用户数 window。
我对 SQL 和 Teradata 都不熟悉,我无法按需要使用 window 功能。
我需要以下结果,其中 ACTIVE 是前 90 天 window 中出现的唯一 USER_IDs 的计数。
DATES ACTIVE_IN_WINDOW
12/06/2018 20
13/06/2018 45
14/06/2018 65
15/06/2018 73
17/06/2018 24
18/06/2018 87
19/06/2018 34
20/06/2018 51
目前我的脚本如下
这里就是这一行我搞不对
COUNT ( USER_ID) OVER (PARTITION BY USER_ID ORDER BY EVT_DT ROWS BETWEEN 90 PRECEDING AND 0 FOLLOWING)
我怀疑我需要一组不同的函数来完成这项工作。
SELECT b.DATES , a.ACTIVE_IN_WINDOW
FROM
(
SELECT
CAST(CALENDAR_DATE AS DATE) AS DATES FROM SYS_CALENDAR.CALENDAR
WHERE DATES BETWEEN ADD_MONTHS(CURRENT_DATE, - 10) AND CURRENT_DATE
) b
LEFT JOIN
(
SELECT USER_ID , EVT_DT
, COUNT ( USER_ID) OVER (PARTITION BY USER_ID ORDER BY EVT_DT ROWS BETWEEN 90 PRECEDING AND 0 FOLLOWING) AS ACTIVE_IN_WINDOW
FROM ENV0.R_ONBOARDING
) a
ON a.EVT_DT = b.DATES
ORDER BY b.DATES
感谢您的帮助。
如果您的数据不是太大,子查询可能是最简单的方法:
SELECT c.dte,
(SELECT COUNT(DISTINCT o.USER_ID)
FROM ENV0.R_ONBOARDING o
WHERE o.EVT_DT > ADD_MONTHS(dte, -3) AND
o.EVT_DT <= dte
) as three_month_count
FROM (SELECT CAST(CALENDAR_DATE AS DATE) AS dte
FROM SYS_CALENDAR.CALENDAR
WHERE CALENDAR_DATE BETWEEN ADD_MONTHS(CURRENT_DATE, - 10) AND CURRENT_DATE
) c;
您可能希望从比 3 个月更短的时间范围开始,以查看查询的执行情况。
逻辑类似于 Gordon',但是 非等值连接 而不是 相关标量子查询 通常更有效在 Teradata 上:
SELECT b.DATES , Count(DISTINCT USER_ID)
FROM
(
SELECT CALENDAR_DATE AS DATES
FROM SYS_CALENDAR.CALENDAR
WHERE DATES BETWEEN Add_Months(Current_Date, - 10) AND Current_Date
) b
LEFT JOIN
( -- apply DISTINCT before aggregation to reduce intermediate spool
SELECT DISTINCT USER_ID, EVT_DT
FROM ENV0.R_ONBOARDING
) AS a
ON a.EVT_DT BETWEEN Add_Months(b.DATES,-3) AND b.DATES
GROUP BY 1
ORDER BY 1
当然,这将需要一个大线轴和很多 CPU。
编辑:
切换到周可以减少开销,我使用日期而不是周数(修改其他范围更容易):
SELECT b.Week , Count(DISTINCT USER_ID)
FROM
( -- Return only Mondays instead of DISTINCT over all days
SELECT calendar_date AS Week
FROM SYS_CALENDAR.CALENDAR
WHERE CALENDAR_DATE BETWEEN Add_Months(Current_Date, -9) AND Current_Date
AND day_of_week = 2 -- 2 = Monday
) b
LEFT JOIN
(
SELECT DISTINCT USER_ID,
-- td_monday returns the previous Monday, but we need the following monday
-- covers the previous Tuesday up to the current Monday
Td_Monday(EVT_DT+6) AS PERIOD_WEEK
FROM ENV0.R_ONBOARDING
-- You should add another condition to limit the actually covered date range, e.g.
-- where EVT_DT BETWEEN Add_Months(b.DATES,-13) AND b.DATES
) AS a
ON a.PERIOD_WEEK BETWEEN b.Week-(12*7) AND b.Week
GROUP BY 1
ORDER BY 1
解释应该复制日历作为产品连接的准备,否则您可能需要在 Volatile Table 中具体化日期。最好不要使用 sys_calendar
,没有统计数据,例如优化器不知道每个 week/month/year 有多少天,等等。检查你的系统,应该有一个日历 table 为你公司的需要设计(所有列都有统计数据)
我有用户登录日期。我的要求是跟踪过去 90 天内登录的用户数 window。
我对 SQL 和 Teradata 都不熟悉,我无法按需要使用 window 功能。
我需要以下结果,其中 ACTIVE 是前 90 天 window 中出现的唯一 USER_IDs 的计数。
DATES ACTIVE_IN_WINDOW
12/06/2018 20
13/06/2018 45
14/06/2018 65
15/06/2018 73
17/06/2018 24
18/06/2018 87
19/06/2018 34
20/06/2018 51
目前我的脚本如下
这里就是这一行我搞不对
COUNT ( USER_ID) OVER (PARTITION BY USER_ID ORDER BY EVT_DT ROWS BETWEEN 90 PRECEDING AND 0 FOLLOWING)
我怀疑我需要一组不同的函数来完成这项工作。
SELECT b.DATES , a.ACTIVE_IN_WINDOW
FROM
(
SELECT
CAST(CALENDAR_DATE AS DATE) AS DATES FROM SYS_CALENDAR.CALENDAR
WHERE DATES BETWEEN ADD_MONTHS(CURRENT_DATE, - 10) AND CURRENT_DATE
) b
LEFT JOIN
(
SELECT USER_ID , EVT_DT
, COUNT ( USER_ID) OVER (PARTITION BY USER_ID ORDER BY EVT_DT ROWS BETWEEN 90 PRECEDING AND 0 FOLLOWING) AS ACTIVE_IN_WINDOW
FROM ENV0.R_ONBOARDING
) a
ON a.EVT_DT = b.DATES
ORDER BY b.DATES
感谢您的帮助。
如果您的数据不是太大,子查询可能是最简单的方法:
SELECT c.dte,
(SELECT COUNT(DISTINCT o.USER_ID)
FROM ENV0.R_ONBOARDING o
WHERE o.EVT_DT > ADD_MONTHS(dte, -3) AND
o.EVT_DT <= dte
) as three_month_count
FROM (SELECT CAST(CALENDAR_DATE AS DATE) AS dte
FROM SYS_CALENDAR.CALENDAR
WHERE CALENDAR_DATE BETWEEN ADD_MONTHS(CURRENT_DATE, - 10) AND CURRENT_DATE
) c;
您可能希望从比 3 个月更短的时间范围开始,以查看查询的执行情况。
逻辑类似于 Gordon',但是 非等值连接 而不是 相关标量子查询 通常更有效在 Teradata 上:
SELECT b.DATES , Count(DISTINCT USER_ID)
FROM
(
SELECT CALENDAR_DATE AS DATES
FROM SYS_CALENDAR.CALENDAR
WHERE DATES BETWEEN Add_Months(Current_Date, - 10) AND Current_Date
) b
LEFT JOIN
( -- apply DISTINCT before aggregation to reduce intermediate spool
SELECT DISTINCT USER_ID, EVT_DT
FROM ENV0.R_ONBOARDING
) AS a
ON a.EVT_DT BETWEEN Add_Months(b.DATES,-3) AND b.DATES
GROUP BY 1
ORDER BY 1
当然,这将需要一个大线轴和很多 CPU。
编辑:
切换到周可以减少开销,我使用日期而不是周数(修改其他范围更容易):
SELECT b.Week , Count(DISTINCT USER_ID)
FROM
( -- Return only Mondays instead of DISTINCT over all days
SELECT calendar_date AS Week
FROM SYS_CALENDAR.CALENDAR
WHERE CALENDAR_DATE BETWEEN Add_Months(Current_Date, -9) AND Current_Date
AND day_of_week = 2 -- 2 = Monday
) b
LEFT JOIN
(
SELECT DISTINCT USER_ID,
-- td_monday returns the previous Monday, but we need the following monday
-- covers the previous Tuesday up to the current Monday
Td_Monday(EVT_DT+6) AS PERIOD_WEEK
FROM ENV0.R_ONBOARDING
-- You should add another condition to limit the actually covered date range, e.g.
-- where EVT_DT BETWEEN Add_Months(b.DATES,-13) AND b.DATES
) AS a
ON a.PERIOD_WEEK BETWEEN b.Week-(12*7) AND b.Week
GROUP BY 1
ORDER BY 1
解释应该复制日历作为产品连接的准备,否则您可能需要在 Volatile Table 中具体化日期。最好不要使用 sys_calendar
,没有统计数据,例如优化器不知道每个 week/month/year 有多少天,等等。检查你的系统,应该有一个日历 table 为你公司的需要设计(所有列都有统计数据)