如何通过来自 Firebase 的某些用户事件过滤 BigQuery 中的保留计算
How to filter retention calculations in BigQuery by certain user events from Firebase
我根据此处共享的查询进行了查询:https://github.com/sagishporer/big-query-queries-for-firebase/wiki/Query:-Daily-retention
使用从 Firebase 流式传输的数据计算 BigQuery 中的用户保留率。
到目前为止它一直在工作,但是随着数据集变大,它不再能够 运行 由于以下错误:
Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 129% of limit. Top memory consumer(s): sort operations used for analytic OVER() clauses: 100%
查询如下:
SELECT
install_date,
SUM(CASE
WHEN days_since_install = 0 THEN users
ELSE 0 END) AS day_0,
SUM(CASE
WHEN days_since_install = 1 THEN users
ELSE 0 END) AS day_1,
SUM(CASE
WHEN days_since_install = 2 THEN users
ELSE 0 END) AS day_2,
SUM(CASE
WHEN days_since_install = 3 THEN users
ELSE 0 END) AS day_3,
SUM(CASE
WHEN days_since_install = 4 THEN users
ELSE 0 END) AS day_4,
SUM(CASE
WHEN days_since_install = 5 THEN users
ELSE 0 END) AS day_5,
SUM(CASE
WHEN days_since_install = 6 THEN users
ELSE 0 END) AS day_6,
SUM(CASE
WHEN days_since_install = 7 THEN users
ELSE 0 END) AS day_7,
SUM(CASE
WHEN days_since_install = 8 THEN users
ELSE 0 END) AS day_8,
SUM(CASE
WHEN days_since_install = 9 THEN users
ELSE 0 END) AS day_9,
SUM(CASE
WHEN days_since_install = 10 THEN users
ELSE 0 END) AS day_10,
SUM(CASE
WHEN days_since_install = 11 THEN users
ELSE 0 END) AS day_11,
SUM(CASE
WHEN days_since_install = 12 THEN users
ELSE 0 END) AS day_12,
SUM(CASE
WHEN days_since_install = 13 THEN users
ELSE 0 END) AS day_13,
SUM(CASE
WHEN days_since_install = 14 THEN users
ELSE 0 END) AS day_14,
SUM(CASE
WHEN days_since_install = 15 THEN users
ELSE 0 END) AS day_15,
SUM(CASE
WHEN days_since_install = 16 THEN users
ELSE 0 END) AS day_16,
SUM(CASE
WHEN days_since_install = 17 THEN users
ELSE 0 END) AS day_17,
SUM(CASE
WHEN days_since_install = 18 THEN users
ELSE 0 END) AS day_18,
SUM(CASE
WHEN days_since_install = 19 THEN users
ELSE 0 END) AS day_19,
SUM(CASE
WHEN days_since_install = 20 THEN users
ELSE 0 END) AS day_20,
SUM(CASE
WHEN days_since_install = 21 THEN users
ELSE 0 END) AS day_21,
SUM(CASE
WHEN days_since_install = 22 THEN users
ELSE 0 END) AS day_22,
SUM(CASE
WHEN days_since_install = 23 THEN users
ELSE 0 END) AS day_23,
SUM(CASE
WHEN days_since_install = 24 THEN users
ELSE 0 END) AS day_24,
SUM(CASE
WHEN days_since_install = 25 THEN users
ELSE 0 END) AS day_25,
SUM(CASE
WHEN days_since_install = 26 THEN users
ELSE 0 END) AS day_26,
SUM(CASE
WHEN days_since_install = 27 THEN users
ELSE 0 END) AS day_27,
SUM(CASE
WHEN days_since_install = 28 THEN users
ELSE 0 END) AS day_28,
SUM(CASE
WHEN days_since_install = 29 THEN users
ELSE 0 END) AS day_29,
SUM(CASE
WHEN days_since_install = 30 THEN users
ELSE 0 END) AS day_30
FROM (
SELECT
DATE(TIMESTAMP_MICROS(user_first_touch_timestamp)) AS install_date,
DATE(TIMESTAMP_MICROS(event_timestamp)) AS event_realdate,
DATE_DIFF(DATE(TIMESTAMP_MICROS(event_timestamp)), DATE(TIMESTAMP_MICROS(user_first_touch_timestamp)), day) AS days_since_install,
COUNT(DISTINCT user_pseudo_id) AS users
FROM
`dataset.events_2019*`
WHERE
event_name = 'user_engagement'
AND user_pseudo_id NOT IN (
SELECT
user_pseudo_id
FROM (
SELECT
MIN(global_session_id),
user_pseudo_id,
user_first_touch_timestamp,
event_timestamp
FROM (
SELECT
*,
IF (previous_event='some_event'
AND LAG(global_session_id,1)OVER (ORDER BY global_session_id, event_name)=global_session_id,
LAG(global_session_id,1) OVER (ORDER BY global_session_id, event_name),
NULL) AS match
FROM (
SELECT
*,
LAG(event_name,1) OVER (ORDER BY global_session_id, event_name) AS previous_event
FROM (
SELECT
global_session_id,
event_name,
user_first_touch_timestamp,
event_timestamp,
user_pseudo_id
FROM (
SELECT
global_session_id,
event_name,
user_pseudo_id,
event_timestamp,
user_first_touch_timestamp,
IF (some_kill=1,
global_session_id,
NULL) AS session_some_kill,
IF (event_name='user_engagement',
global_session_id,
NULL) AS session
FROM (
SELECT
*,
CASE
WHEN event_params.key = 'Kills' AND event_params.value.int_value>0 THEN 1
ELSE 0
END AS some_kill,
SUM(is_new_session) OVER (ORDER BY user_pseudo_id, event_timestamp, event_name) AS global_session_id,
SUM(is_new_session) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp) AS user_session_id
FROM (
SELECT
*,
CASE
WHEN event_timestamp - last_event >= (30 * 60 * 1000) OR last_event IS NULL THEN 1
ELSE 0
END AS is_new_session
FROM (
SELECT
user_pseudo_id,
event_timestamp,
event_name,
event_params,
user_first_touch_timestamp,
LAG(event_timestamp,1) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp) AS last_event
FROM (
SELECT
user_pseudo_id,
event_timestamp,
event_name,
event_params,
user_first_touch_timestamp
FROM `dataset.events_2019*`,
UNNEST (event_params) AS event_params)
) last
) agg
)
)
WHERE
session_some_kill IS NOT NULL
OR session IS NOT NULL
GROUP BY
global_session_id,
event_name,
user_first_touch_timestamp,
event_timestamp,
user_pseudo_id
ORDER BY
global_session_id ) ) )
WHERE
match IS NOT NULL
AND event_timestamp-user_first_touch_timestamp<1.8e+9
GROUP BY
user_pseudo_id,
user_first_touch_timestamp,
event_timestamp))
GROUP BY
install_date,
event_realdate,
days_since_install )
GROUP BY
install_date
HAVING
day_0 > 0 /* Remove older dates - not enough data, you should also ignore the first record for partial data */
ORDER BY
install_date
尝试以下操作:
在您的 UNNEST
子句中添加一个 WHERE
以减少影响性能的 return 记录的大小,例如:
SELECT
user_pseudo_id,
event_timestamp,
event_name,
event_params,
user_first_touch_timestamp
FROM `analytics_185672896.events_2019*`,
UNNEST (event_params) AS event_params)
<b>event_name = 'user_engagement</b>
删除内部 SQL 中的 ORDER BY
以避免在不需要的地方进行额外计算,因为 BQ 需要在执行下一步之前获取所有结果和 ORDER计划,请参阅此 link 了解更多信息
我们分两次完成这种事情。首先,我们计算每个用户活跃的天数,然后我们进行任何需要的计算。
我们为每个用户使用这样的东西来存储活跃天数:
SELECT
user_id as userId,
BIT_OR(1 << GREATEST(0, (DIV(event_timestamp, (24 * 60 * 60 * 1000000)) - DIV(user_first_touch_timestamp,(24 * 60 * 60 * 1000000)) ))) as DX,
活动天数存储为位域,每个 INT64 最多 64 天,而不是单独计算和存储每个单独的天数。
您可以根据需要添加更多天数,通过抵消班次,每个 INT64 增加 64 天。
查询 运行 和导出非常快。
这是UTC,您可以根据需要转换为本地时间。
我们只需要 GREATEST,因为我们按 user_id 分组并使用帐户链接,当用户卸载、重新安装和链接时,用户会得到另一个 first_touch_timestamp,它比旧事件更新。
希望这对您有所帮助,
我根据此处共享的查询进行了查询:https://github.com/sagishporer/big-query-queries-for-firebase/wiki/Query:-Daily-retention 使用从 Firebase 流式传输的数据计算 BigQuery 中的用户保留率。
到目前为止它一直在工作,但是随着数据集变大,它不再能够 运行 由于以下错误:
Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 129% of limit. Top memory consumer(s): sort operations used for analytic OVER() clauses: 100%
查询如下:
SELECT
install_date,
SUM(CASE
WHEN days_since_install = 0 THEN users
ELSE 0 END) AS day_0,
SUM(CASE
WHEN days_since_install = 1 THEN users
ELSE 0 END) AS day_1,
SUM(CASE
WHEN days_since_install = 2 THEN users
ELSE 0 END) AS day_2,
SUM(CASE
WHEN days_since_install = 3 THEN users
ELSE 0 END) AS day_3,
SUM(CASE
WHEN days_since_install = 4 THEN users
ELSE 0 END) AS day_4,
SUM(CASE
WHEN days_since_install = 5 THEN users
ELSE 0 END) AS day_5,
SUM(CASE
WHEN days_since_install = 6 THEN users
ELSE 0 END) AS day_6,
SUM(CASE
WHEN days_since_install = 7 THEN users
ELSE 0 END) AS day_7,
SUM(CASE
WHEN days_since_install = 8 THEN users
ELSE 0 END) AS day_8,
SUM(CASE
WHEN days_since_install = 9 THEN users
ELSE 0 END) AS day_9,
SUM(CASE
WHEN days_since_install = 10 THEN users
ELSE 0 END) AS day_10,
SUM(CASE
WHEN days_since_install = 11 THEN users
ELSE 0 END) AS day_11,
SUM(CASE
WHEN days_since_install = 12 THEN users
ELSE 0 END) AS day_12,
SUM(CASE
WHEN days_since_install = 13 THEN users
ELSE 0 END) AS day_13,
SUM(CASE
WHEN days_since_install = 14 THEN users
ELSE 0 END) AS day_14,
SUM(CASE
WHEN days_since_install = 15 THEN users
ELSE 0 END) AS day_15,
SUM(CASE
WHEN days_since_install = 16 THEN users
ELSE 0 END) AS day_16,
SUM(CASE
WHEN days_since_install = 17 THEN users
ELSE 0 END) AS day_17,
SUM(CASE
WHEN days_since_install = 18 THEN users
ELSE 0 END) AS day_18,
SUM(CASE
WHEN days_since_install = 19 THEN users
ELSE 0 END) AS day_19,
SUM(CASE
WHEN days_since_install = 20 THEN users
ELSE 0 END) AS day_20,
SUM(CASE
WHEN days_since_install = 21 THEN users
ELSE 0 END) AS day_21,
SUM(CASE
WHEN days_since_install = 22 THEN users
ELSE 0 END) AS day_22,
SUM(CASE
WHEN days_since_install = 23 THEN users
ELSE 0 END) AS day_23,
SUM(CASE
WHEN days_since_install = 24 THEN users
ELSE 0 END) AS day_24,
SUM(CASE
WHEN days_since_install = 25 THEN users
ELSE 0 END) AS day_25,
SUM(CASE
WHEN days_since_install = 26 THEN users
ELSE 0 END) AS day_26,
SUM(CASE
WHEN days_since_install = 27 THEN users
ELSE 0 END) AS day_27,
SUM(CASE
WHEN days_since_install = 28 THEN users
ELSE 0 END) AS day_28,
SUM(CASE
WHEN days_since_install = 29 THEN users
ELSE 0 END) AS day_29,
SUM(CASE
WHEN days_since_install = 30 THEN users
ELSE 0 END) AS day_30
FROM (
SELECT
DATE(TIMESTAMP_MICROS(user_first_touch_timestamp)) AS install_date,
DATE(TIMESTAMP_MICROS(event_timestamp)) AS event_realdate,
DATE_DIFF(DATE(TIMESTAMP_MICROS(event_timestamp)), DATE(TIMESTAMP_MICROS(user_first_touch_timestamp)), day) AS days_since_install,
COUNT(DISTINCT user_pseudo_id) AS users
FROM
`dataset.events_2019*`
WHERE
event_name = 'user_engagement'
AND user_pseudo_id NOT IN (
SELECT
user_pseudo_id
FROM (
SELECT
MIN(global_session_id),
user_pseudo_id,
user_first_touch_timestamp,
event_timestamp
FROM (
SELECT
*,
IF (previous_event='some_event'
AND LAG(global_session_id,1)OVER (ORDER BY global_session_id, event_name)=global_session_id,
LAG(global_session_id,1) OVER (ORDER BY global_session_id, event_name),
NULL) AS match
FROM (
SELECT
*,
LAG(event_name,1) OVER (ORDER BY global_session_id, event_name) AS previous_event
FROM (
SELECT
global_session_id,
event_name,
user_first_touch_timestamp,
event_timestamp,
user_pseudo_id
FROM (
SELECT
global_session_id,
event_name,
user_pseudo_id,
event_timestamp,
user_first_touch_timestamp,
IF (some_kill=1,
global_session_id,
NULL) AS session_some_kill,
IF (event_name='user_engagement',
global_session_id,
NULL) AS session
FROM (
SELECT
*,
CASE
WHEN event_params.key = 'Kills' AND event_params.value.int_value>0 THEN 1
ELSE 0
END AS some_kill,
SUM(is_new_session) OVER (ORDER BY user_pseudo_id, event_timestamp, event_name) AS global_session_id,
SUM(is_new_session) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp) AS user_session_id
FROM (
SELECT
*,
CASE
WHEN event_timestamp - last_event >= (30 * 60 * 1000) OR last_event IS NULL THEN 1
ELSE 0
END AS is_new_session
FROM (
SELECT
user_pseudo_id,
event_timestamp,
event_name,
event_params,
user_first_touch_timestamp,
LAG(event_timestamp,1) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp) AS last_event
FROM (
SELECT
user_pseudo_id,
event_timestamp,
event_name,
event_params,
user_first_touch_timestamp
FROM `dataset.events_2019*`,
UNNEST (event_params) AS event_params)
) last
) agg
)
)
WHERE
session_some_kill IS NOT NULL
OR session IS NOT NULL
GROUP BY
global_session_id,
event_name,
user_first_touch_timestamp,
event_timestamp,
user_pseudo_id
ORDER BY
global_session_id ) ) )
WHERE
match IS NOT NULL
AND event_timestamp-user_first_touch_timestamp<1.8e+9
GROUP BY
user_pseudo_id,
user_first_touch_timestamp,
event_timestamp))
GROUP BY
install_date,
event_realdate,
days_since_install )
GROUP BY
install_date
HAVING
day_0 > 0 /* Remove older dates - not enough data, you should also ignore the first record for partial data */
ORDER BY
install_date
尝试以下操作:
在您的
UNNEST
子句中添加一个WHERE
以减少影响性能的 return 记录的大小,例如:SELECT user_pseudo_id, event_timestamp, event_name, event_params, user_first_touch_timestamp FROM `analytics_185672896.events_2019*`, UNNEST (event_params) AS event_params) <b>event_name = 'user_engagement</b>
删除内部 SQL 中的
ORDER BY
以避免在不需要的地方进行额外计算,因为 BQ 需要在执行下一步之前获取所有结果和 ORDER计划,请参阅此 link 了解更多信息
我们分两次完成这种事情。首先,我们计算每个用户活跃的天数,然后我们进行任何需要的计算。
我们为每个用户使用这样的东西来存储活跃天数:
SELECT
user_id as userId,
BIT_OR(1 << GREATEST(0, (DIV(event_timestamp, (24 * 60 * 60 * 1000000)) - DIV(user_first_touch_timestamp,(24 * 60 * 60 * 1000000)) ))) as DX,
活动天数存储为位域,每个 INT64 最多 64 天,而不是单独计算和存储每个单独的天数。 您可以根据需要添加更多天数,通过抵消班次,每个 INT64 增加 64 天。 查询 运行 和导出非常快。
这是UTC,您可以根据需要转换为本地时间。
我们只需要 GREATEST,因为我们按 user_id 分组并使用帐户链接,当用户卸载、重新安装和链接时,用户会得到另一个 first_touch_timestamp,它比旧事件更新。
希望这对您有所帮助,