BigQuery 中的第 N 天保留,错误消息:时区无效
Day N retention in BigQuery, error message: Invalid time zone
我正在尝试计算 google Big Query 中数据集的第 N 天保留。 table 包含移动应用一个月的数据,我想知道每天有多少用户返回。我正在使用标准 SQL。到目前为止,我的代码是
SELECT date(d1.eventDate) as dt,
COUNT(distinct d1.userID) as total_users,
COUNT(distinct d2.userID) as retained_users
FROM `dataset` as d1
LEFT JOIN `dataset` as d2 ON
d1.userID = d2.userID
AND date(d1.eventDate) = date(datetime(d2.eventDate, '-1 day'))
GROUP BY 1
ORDER BY 1"
当我尝试执行时收到错误消息
Error: Invalid time zone: -1 day [invalidQuery]
我的 table 结构是
eventDate | UserID |
2016-05-06 00:00:00 UTC | 100000 |
2016-05-06 00:00:00 UTC | 200000 |
2016-05-06 00:00:00 UTC | 300000 |
我应该使用什么来代替“-1 天”?
TIMESTAMP_SUB
会修复所写的查询,但出于性能原因,作为解决方案可能不够好。但至少它能让你减去 1 天:
SELECT date(d1.created_at) as dt,
COUNT(distinct d1.actor.id) as total_users,
COUNT(distinct d2.actor.id) as retained_users
FROM `githubarchive.month.201810` as d1
LEFT JOIN `githubarchive.month.201810` as d2 ON
d1.actor.id = d2.actor.id
AND date(d1.created_at) = date(TIMESTAMP_SUB(d2.created_at, INTERVAL -24 HOUR))
GROUP BY 1
ORDER BY 1
为了提高性能,在 JOIN 之前做一些重复数据删除:
SELECT day as dt,
COUNT(distinct d1.id) as total_users,
COUNT(distinct d2.id) as retained_users
FROM (SELECT DISTINCT actor.id, DATE(created_at) day FROM `githubarchive.month.201810`)as d1
LEFT JOIN (SELECT DISTINCT actor.id, DATE(TIMESTAMP_SUB(created_at, INTERVAL -24 HOUR)) day FROM `githubarchive.month.201810`) as d2
USING (id, day)
GROUP BY 1
ORDER BY 1
以下适用于 BigQuery Standard SQL,并进一步优化为不使用任何 JOIN,而是使用分析函数
#standardSQL
SELECT
day,
COUNT(1) total_users,
COUNTIF(delta = 1) retained_users
FROM (
SELECT
day, id,
DATE_DIFF(day, LAG(day) OVER(PARTITION BY id ORDER BY day), DAY) delta
FROM (
SELECT DISTINCT
DATE(created_at) day,
actor.id
FROM `githubarchive.month.201810`
)
)
GROUP BY day
ORDER BY day
或者,如果使用原始问题的符号:
#standardSQL
SELECT
day,
COUNT(1) total_users,
COUNTIF(delta = 1) retained_users
FROM (
SELECT
day, userID,
DATE_DIFF(day, LAG(day) OVER(PARTITION BY userID ORDER BY day), DAY) delta
FROM (
SELECT DISTINCT
DATE(eventDate) day,
userID
FROM `project.dataset.table`
)
)
GROUP BY day
ORDER BY day
我正在尝试计算 google Big Query 中数据集的第 N 天保留。 table 包含移动应用一个月的数据,我想知道每天有多少用户返回。我正在使用标准 SQL。到目前为止,我的代码是
SELECT date(d1.eventDate) as dt,
COUNT(distinct d1.userID) as total_users,
COUNT(distinct d2.userID) as retained_users
FROM `dataset` as d1
LEFT JOIN `dataset` as d2 ON
d1.userID = d2.userID
AND date(d1.eventDate) = date(datetime(d2.eventDate, '-1 day'))
GROUP BY 1
ORDER BY 1"
当我尝试执行时收到错误消息
Error: Invalid time zone: -1 day [invalidQuery]
我的 table 结构是
eventDate | UserID |
2016-05-06 00:00:00 UTC | 100000 |
2016-05-06 00:00:00 UTC | 200000 |
2016-05-06 00:00:00 UTC | 300000 |
我应该使用什么来代替“-1 天”?
TIMESTAMP_SUB
会修复所写的查询,但出于性能原因,作为解决方案可能不够好。但至少它能让你减去 1 天:
SELECT date(d1.created_at) as dt,
COUNT(distinct d1.actor.id) as total_users,
COUNT(distinct d2.actor.id) as retained_users
FROM `githubarchive.month.201810` as d1
LEFT JOIN `githubarchive.month.201810` as d2 ON
d1.actor.id = d2.actor.id
AND date(d1.created_at) = date(TIMESTAMP_SUB(d2.created_at, INTERVAL -24 HOUR))
GROUP BY 1
ORDER BY 1
为了提高性能,在 JOIN 之前做一些重复数据删除:
SELECT day as dt,
COUNT(distinct d1.id) as total_users,
COUNT(distinct d2.id) as retained_users
FROM (SELECT DISTINCT actor.id, DATE(created_at) day FROM `githubarchive.month.201810`)as d1
LEFT JOIN (SELECT DISTINCT actor.id, DATE(TIMESTAMP_SUB(created_at, INTERVAL -24 HOUR)) day FROM `githubarchive.month.201810`) as d2
USING (id, day)
GROUP BY 1
ORDER BY 1
以下适用于 BigQuery Standard SQL,并进一步优化为不使用任何 JOIN,而是使用分析函数
#standardSQL
SELECT
day,
COUNT(1) total_users,
COUNTIF(delta = 1) retained_users
FROM (
SELECT
day, id,
DATE_DIFF(day, LAG(day) OVER(PARTITION BY id ORDER BY day), DAY) delta
FROM (
SELECT DISTINCT
DATE(created_at) day,
actor.id
FROM `githubarchive.month.201810`
)
)
GROUP BY day
ORDER BY day
或者,如果使用原始问题的符号:
#standardSQL
SELECT
day,
COUNT(1) total_users,
COUNTIF(delta = 1) retained_users
FROM (
SELECT
day, userID,
DATE_DIFF(day, LAG(day) OVER(PARTITION BY userID ORDER BY day), DAY) delta
FROM (
SELECT DISTINCT
DATE(eventDate) day,
userID
FROM `project.dataset.table`
)
)
GROUP BY day
ORDER BY day