过去 30 天的移动平均线
Moving average last 30 days
我想查找过去 30 天内活跃的唯一用户数。我想为今天计算这个,也为过去的几天计算。数据集包含保存在 BigQuery 中的用户 ID、日期和用户触发的事件。用户通过打开触发事件 session_start 的移动应用程序处于活动状态。未嵌套数据集的示例。
| resettable_device_id | date | event |
------------------------------------------------------
| xx | 2017-06-09 | session_start |
| yy | 2017-06-09 | session_start |
| xx | 2017-06-11 | session_start |
| zz | 2017-06-11 | session_start |
我找到了适合我问题的解决方案:
到目前为止我的 BigQuery 脚本:
#standardSQL
WITH daily_aggregation AS (
SELECT
PARSE_DATE("%Y%m%d", event_dim.date) AS day,
COUNT(DISTINCT user_dim.device_info.resettable_device_id) AS unique_resettable_device_ids
FROM `ANDROID.app_events_*`,
UNNEST(event_dim) AS event_dim
WHERE event_dim.name = "session_start"
GROUP BY day
)
SELECT
day,
unique_resettable_device_ids,
SUM(unique_resettable_device_ids)
OVER(ORDER BY UNIX_SECONDS(TIMESTAMP(day)) DESC ROWS BETWEEN 2592000 PRECEDING AND CURRENT ROW) AS unique_ids_rolling_30_days
FROM daily_aggregation
ORDER BY day
此脚本产生以下结果 table:
| day | unique_resettable_device_ids | unique_ids_rolling_30_days |
------------------------------------------------------------------------
| 2018-06-05 | 1807 | 2614 |
| 2018-06-06 | 711 | 807 |
| 2018-06-07 | 96 | 96 |
问题在于 unique_ids_rolling_30_days 列只是 unique_resettable_device_ids 列的累加和。如何修复脚本中的滚动 window 函数?
"The problem is that the column unique_ids_rolling_30_days is just a cumulative sum of the column unique_resettable_device_ids."
当然,因为这正是代码
SUM(unique_resettable_device_ids)
OVER(ORDER BY UNIX_SECONDS(TIMESTAMP(day)) DESC ROWS BETWEEN 2592000 PRECEDING AND CURRENT ROW) AS unique_ids_rolling_30_days
正在索取。
检查 问题询问的是滚动中专门计算唯一值的问题 window:事实证明,考虑到它需要多少内存,这是一个非常慢的操作。
当您想要滚动计数唯一身份时的解决方案:获取近似结果。
来自链接的答案:
#standardSQL
SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
, HLL_COUNT.MERGE(sketch) unique_90_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<31,sketch,null)) unique_30_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<8,sketch,null)) unique_7_day_users
, COUNT(*) window_days
FROM (
SELECT DATE(creation_date) date, HLL_COUNT.INIT(owner_user_id) sketch
FROM `bigquery-public-data.Whosebug.posts_questions`
WHERE EXTRACT(YEAR FROM creation_date)=2017
GROUP BY 1
), UNNEST(GENERATE_ARRAY(1, 90)) i
GROUP BY 1
HAVING window_days=90
ORDER BY date_grp
每周计算过去 30 天活跃用户数的工作解决方案。
#standardSQL
WITH days AS (
SELECT day
FROM UNNEST(GENERATE_DATE_ARRAY('2018-01-01', CURRENT_DATE(), INTERVAL 1 WEEK)) AS day
), periods AS (
SELECT
DATE_SUB(days.day, INTERVAL 30 DAY) AS StartDate,
days.day AS EndDate FROM days
)
SELECT
periods.EndDate AS Day,
COUNT(DISTINCT user_dim.device_info.resettable_device_id) as resettable_device_ids
FROM `ANDROID.app_events_*`,
UNNEST(event_dim) AS event_dim
CROSS JOIN periods
WHERE
PARSE_DATE("%Y%m%d", event_dim.date) BETWEEN periods.StartDate AND periods.EndDate
AND event_dim.name = "session_start"
GROUP BY Day
ORDER BY Day DESC
我想查找过去 30 天内活跃的唯一用户数。我想为今天计算这个,也为过去的几天计算。数据集包含保存在 BigQuery 中的用户 ID、日期和用户触发的事件。用户通过打开触发事件 session_start 的移动应用程序处于活动状态。未嵌套数据集的示例。
| resettable_device_id | date | event |
------------------------------------------------------
| xx | 2017-06-09 | session_start |
| yy | 2017-06-09 | session_start |
| xx | 2017-06-11 | session_start |
| zz | 2017-06-11 | session_start |
我找到了适合我问题的解决方案:
到目前为止我的 BigQuery 脚本:
#standardSQL
WITH daily_aggregation AS (
SELECT
PARSE_DATE("%Y%m%d", event_dim.date) AS day,
COUNT(DISTINCT user_dim.device_info.resettable_device_id) AS unique_resettable_device_ids
FROM `ANDROID.app_events_*`,
UNNEST(event_dim) AS event_dim
WHERE event_dim.name = "session_start"
GROUP BY day
)
SELECT
day,
unique_resettable_device_ids,
SUM(unique_resettable_device_ids)
OVER(ORDER BY UNIX_SECONDS(TIMESTAMP(day)) DESC ROWS BETWEEN 2592000 PRECEDING AND CURRENT ROW) AS unique_ids_rolling_30_days
FROM daily_aggregation
ORDER BY day
此脚本产生以下结果 table:
| day | unique_resettable_device_ids | unique_ids_rolling_30_days |
------------------------------------------------------------------------
| 2018-06-05 | 1807 | 2614 |
| 2018-06-06 | 711 | 807 |
| 2018-06-07 | 96 | 96 |
问题在于 unique_ids_rolling_30_days 列只是 unique_resettable_device_ids 列的累加和。如何修复脚本中的滚动 window 函数?
"The problem is that the column unique_ids_rolling_30_days is just a cumulative sum of the column unique_resettable_device_ids."
当然,因为这正是代码
SUM(unique_resettable_device_ids)
OVER(ORDER BY UNIX_SECONDS(TIMESTAMP(day)) DESC ROWS BETWEEN 2592000 PRECEDING AND CURRENT ROW) AS unique_ids_rolling_30_days
正在索取。
检查
当您想要滚动计数唯一身份时的解决方案:获取近似结果。
来自链接的答案:
#standardSQL
SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
, HLL_COUNT.MERGE(sketch) unique_90_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<31,sketch,null)) unique_30_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<8,sketch,null)) unique_7_day_users
, COUNT(*) window_days
FROM (
SELECT DATE(creation_date) date, HLL_COUNT.INIT(owner_user_id) sketch
FROM `bigquery-public-data.Whosebug.posts_questions`
WHERE EXTRACT(YEAR FROM creation_date)=2017
GROUP BY 1
), UNNEST(GENERATE_ARRAY(1, 90)) i
GROUP BY 1
HAVING window_days=90
ORDER BY date_grp
每周计算过去 30 天活跃用户数的工作解决方案。
#standardSQL
WITH days AS (
SELECT day
FROM UNNEST(GENERATE_DATE_ARRAY('2018-01-01', CURRENT_DATE(), INTERVAL 1 WEEK)) AS day
), periods AS (
SELECT
DATE_SUB(days.day, INTERVAL 30 DAY) AS StartDate,
days.day AS EndDate FROM days
)
SELECT
periods.EndDate AS Day,
COUNT(DISTINCT user_dim.device_info.resettable_device_id) as resettable_device_ids
FROM `ANDROID.app_events_*`,
UNNEST(event_dim) AS event_dim
CROSS JOIN periods
WHERE
PARSE_DATE("%Y%m%d", event_dim.date) BETWEEN periods.StartDate AND periods.EndDate
AND event_dim.name = "session_start"
GROUP BY Day
ORDER BY Day DESC