如何根据行类型获取每种类型的最新行并执行计算?
How to obtain the most recent row per type and perform calculations, depending on the row type?
我需要一些帮助 writing/optimizing 查询以按类型检索每行的最新版本并根据类型执行一些计算。我想如果我用一个例子来说明会更好。
给定以下数据集:
+-------+-------------------+---------------------+-------------+---------------------+--------+----------+
| id | event_type | event_timestamp | message_id | sent_at | status | rate |
+-------+-------------------+---------------------+-------------+---------------------+--------+----------+
| 1 | create | 2016-11-25 09:17:48 | 1 | 2016-11-25 09:17:48 | 0 | 0.500000 |
| 2 | status_update | 2016-11-25 09:24:38 | 1 | 2016-11-25 09:28:49 | 1 | 0.500000 |
| 3 | create | 2016-11-25 09:47:48 | 2 | 2016-11-25 09:47:48 | 0 | 0.500000 |
| 4 | status_update | 2016-11-25 09:54:38 | 2 | 2016-11-25 09:48:49 | 1 | 0.500000 |
| 5 | rate_update | 2016-11-25 09:55:07 | 2 | 2016-11-25 09:50:07 | 0 | 1.000000 |
| 6 | create | 2016-11-26 09:17:48 | 3 | 2016-11-26 09:17:48 | 0 | 0.500000 |
| 7 | create | 2016-11-27 09:17:48 | 4 | 2016-11-27 09:17:48 | 0 | 0.500000 |
| 8 | rate_update | 2016-11-27 09:55:07 | 4 | 2016-11-27 09:50:07 | 0 | 2.000000 |
| 9 | rate_update | 2016-11-27 09:55:07 | 2 | 2016-11-25 09:55:07 | 0 | 2.000000 |
+-------+-------------------+---------------------+-------------+---------------------+--------+----------+
预期结果应该是:
+------------+--------------------+--------------------+-----------------------+
| sent_at | sum(submitted_msg) | sum(delivered_msg) | sum(rate_total) |
+------------+--------------------+--------------------+-----------------------+
| 2016-11-25 | 2 | 2 | 2.500000 |
| 2016-11-26 | 1 | 0 | 0.500000 |
| 2016-11-27 | 1 | 0 | 2.000000 |
+------------+--------------------+--------------------+-----------------------+
post 的末尾是用于获取此结果的查询。我愿意打赌应该有一种方法来优化它,因为它使用带有连接的子查询,而且从我读到的关于 BigQuery 的内容来看,最好避免连接。但首先是一些背景:
本质上,数据集代表一个仅附加的 table,其中写入了多个事件。数据的规模在数亿级,并将增长到数十亿+。由于 BigQuery 中的更新不实用,并且数据正在流式传输到 BQ,我需要一种方法来检索每个事件的最新事件,根据特定条件执行一些计算并 return 一个准确的结果。查询是根据用户输入动态生成的,因此可以包含更多 fields/calculations,但为简单起见已被省略。
- 只有一个
create
事件,但 n
任何其他类型
- 对于每组事件,在计算时只应考虑最新的事件。
- status_update - 更新状态
- rate_update - 更新比率
- 创建 - 不言自明
- 每个不是
create
的事件可能不会携带 original/may 的其余信息不准确(除了 message_id 和事件正在操作的字段)(数据集被简化了,但想象一下有更多的列,以后会添加更多的事件)
- 例如
rate_update
可能会或可能不会设置状态字段,或者不是最终值,因此无法对 rate_update
事件的状态字段进行计算,同样适用于 status_update
- 可以假设 table 按日期分区,每个查询都将使用分区。为了简单起见,现在省略了这些条件。
所以我想我有几个问题:
- 如何优化这个查询?
- 除了
create
之外,将事件放在它们自己的 table 中是否更好,其中唯一可用的字段将是与事件相关的字段,并且需要联接(message_id、event_timestamp)?这会减少处理的数据量吗?
- 将来添加更多事件的最佳方式是什么,这将有自己的条件和计算?
实际上,我们非常欢迎任何有关如何高效友好地查询此数据集的建议!谢谢! :)
我想到的怪物如下。 INNER JOINS
用于检索每行的最新版本,按照此 resource
select
sent_at as sent_at,
sum(submitted_msg) as submitted,
sum(delivered_msg) as delivered,
sum(sales_rate_total) as sales_rate_total
FROM (
#DELIVERED
SELECT
d.message_id,
FORMAT_TIMESTAMP('%Y-%m-%d 00:00:00', sent_at) AS sent_at,
0 as submitted_msg,
sum(if(status=1,1,0)) as delivered_msg,
0 as sales_rate_total
FROM `events` d
INNER JOIN
(
select message_id, max(event_timestamp) as ts
from `events`
where event_type = "status_update"
group by 1
) g on d.message_id = g.message_id and d.event_timestamp = g.ts
GROUP BY 1,2
UNION ALL
#SALES RATE
SELECT
s.message_id,
FORMAT_TIMESTAMP('%Y-%m-%d 00:00:00', sent_at) AS sent_at,
0 as submitted_msg,
0 as delivered_msg,
sum(sales_rate) as sales_rate_total
FROM `events` s
INNER JOIN
(
select message_id, max(event_timestamp) as ts
from `events`
where event_type in ("rate_update", "create")
group by 1
) f on s.message_id = f.message_id and s.event_timestamp = f.ts
GROUP BY 1,2
UNION ALL
#SUBMITTED & REST
SELECT
r.message_id,
FORMAT_TIMESTAMP('%Y-%m-%d 00:00:00', sent_at) AS sent_at,
sum(if(status=0,1,0)) as submitted_msg,
0 as delivered_msg,
0 as sales_rate_total
FROM `events` r
INNER JOIN
(
select message_id, max(event_timestamp) as ts
from `events`
where event_type = "create"
group by 1
) e on r.message_id = e.message_id and r.event_timestamp = e.ts
GROUP BY 1, 2
) k
group by 1
对于每个 table 保存多个事件并且我们需要选择最新事件的地方,我们都有一个视图。
查看: user_profile_latest
SELECT * from (
select rank() over (partition by user_id order by bq.created DESC, bq.insert_id desc) as _rank,
*
FROM [user_profile_event]
) where _rank=1
我们维护一个带有 created 和 insert_id 的记录 BQ,用于重复数据删除。
How can this query be optimized?
试试下面的版本
#standardSQL
WITH types AS (
SELECT
FORMAT_TIMESTAMP('%Y-%m-%d', sent_at) AS sent_at,
message_id,
FIRST_VALUE(status) OVER(PARTITION BY message_id ORDER BY (event_type = "create") DESC, event_timestamp DESC) AS submitted_status,
FIRST_VALUE(status) OVER(PARTITION BY message_id ORDER BY (event_type = "status_update") DESC, event_timestamp DESC) AS delivered_status,
FIRST_VALUE(rate) OVER(PARTITION BY message_id ORDER BY (event_type IN ("rate_update", "create")) DESC, event_timestamp DESC) AS sales_rate
FROM events
), latest AS (
SELECT
sent_at,
message_id,
ANY_VALUE(IF(submitted_status=0,1,0)) AS submitted,
ANY_VALUE(IF(delivered_status=1,1,0)) AS delivered,
ANY_VALUE(sales_rate) AS sales_rate
FROM types
GROUP BY 1, 2
)
SELECT
sent_at,
SUM(submitted) AS submitted,
SUM(delivered) AS delivered,
SUM(sales_rate) AS sales_rate_total
FROM latest
GROUP BY 1
它足够紧凑,易于管理,没有冗余,根本没有连接等。
如果您的 table 已分区 - 您可以通过仅在一个地方调整查询来轻松使用它
如果想先在低流量下检查上面的查询,您可以使用下面的虚拟数据
WITH events AS (
SELECT 1 AS id, 'create' AS event_type, TIMESTAMP '2016-11-25 09:17:48' AS event_timestamp, 1 AS message_id, TIMESTAMP '2016-11-25 09:17:48' AS sent_at, 0 AS status, 0.500000 AS rate UNION ALL
SELECT 2 AS id, 'status_update' AS event_type, TIMESTAMP '2016-11-25 09:24:38' AS event_timestamp, 1 AS message_id, TIMESTAMP '2016-11-25 09:28:49' AS sent_at, 1 AS status, 0.500000 AS rate UNION ALL
SELECT 3 AS id, 'create' AS event_type, TIMESTAMP '2016-11-25 09:47:48' AS event_timestamp, 2 AS message_id, TIMESTAMP '2016-11-25 09:47:48' AS sent_at, 0 AS status, 0.500000 AS rate UNION ALL
SELECT 4 AS id, 'status_update' AS event_type, TIMESTAMP '2016-11-25 09:54:38' AS event_timestamp, 2 AS message_id, TIMESTAMP '2016-11-25 09:48:49' AS sent_at, 1 AS status, 0.500000 AS rate UNION ALL
SELECT 5 AS id, 'rate_update' AS event_type, TIMESTAMP '2016-11-25 09:55:07' AS event_timestamp, 2 AS message_id, TIMESTAMP '2016-11-25 09:50:07' AS sent_at, 0 AS status, 1.000000 AS rate UNION ALL
SELECT 6 AS id, 'create' AS event_type, TIMESTAMP '2016-11-26 09:17:48' AS event_timestamp, 3 AS message_id, TIMESTAMP '2016-11-26 09:17:48' AS sent_at, 0 AS status, 0.500000 AS rate UNION ALL
SELECT 7 AS id, 'create' AS event_type, TIMESTAMP '2016-11-27 09:17:48' AS event_timestamp, 4 AS message_id, TIMESTAMP '2016-11-27 09:17:48' AS sent_at, 0 AS status, 0.500000 AS rate UNION ALL
SELECT 8 AS id, 'rate_update' AS event_type, TIMESTAMP '2016-11-27 09:55:07' AS event_timestamp, 4 AS message_id, TIMESTAMP '2016-11-27 09:50:07' AS sent_at, 0 AS status, 2.000000 AS rate UNION ALL
SELECT 9 AS id, 'rate_update' AS event_type, TIMESTAMP '2016-11-27 09:55:07' AS event_timestamp, 2 AS message_id, TIMESTAMP '2016-11-25 09:55:07' AS sent_at, 0 AS status, 2.000000 AS rate
)
我需要一些帮助 writing/optimizing 查询以按类型检索每行的最新版本并根据类型执行一些计算。我想如果我用一个例子来说明会更好。
给定以下数据集:
+-------+-------------------+---------------------+-------------+---------------------+--------+----------+
| id | event_type | event_timestamp | message_id | sent_at | status | rate |
+-------+-------------------+---------------------+-------------+---------------------+--------+----------+
| 1 | create | 2016-11-25 09:17:48 | 1 | 2016-11-25 09:17:48 | 0 | 0.500000 |
| 2 | status_update | 2016-11-25 09:24:38 | 1 | 2016-11-25 09:28:49 | 1 | 0.500000 |
| 3 | create | 2016-11-25 09:47:48 | 2 | 2016-11-25 09:47:48 | 0 | 0.500000 |
| 4 | status_update | 2016-11-25 09:54:38 | 2 | 2016-11-25 09:48:49 | 1 | 0.500000 |
| 5 | rate_update | 2016-11-25 09:55:07 | 2 | 2016-11-25 09:50:07 | 0 | 1.000000 |
| 6 | create | 2016-11-26 09:17:48 | 3 | 2016-11-26 09:17:48 | 0 | 0.500000 |
| 7 | create | 2016-11-27 09:17:48 | 4 | 2016-11-27 09:17:48 | 0 | 0.500000 |
| 8 | rate_update | 2016-11-27 09:55:07 | 4 | 2016-11-27 09:50:07 | 0 | 2.000000 |
| 9 | rate_update | 2016-11-27 09:55:07 | 2 | 2016-11-25 09:55:07 | 0 | 2.000000 |
+-------+-------------------+---------------------+-------------+---------------------+--------+----------+
预期结果应该是:
+------------+--------------------+--------------------+-----------------------+
| sent_at | sum(submitted_msg) | sum(delivered_msg) | sum(rate_total) |
+------------+--------------------+--------------------+-----------------------+
| 2016-11-25 | 2 | 2 | 2.500000 |
| 2016-11-26 | 1 | 0 | 0.500000 |
| 2016-11-27 | 1 | 0 | 2.000000 |
+------------+--------------------+--------------------+-----------------------+
post 的末尾是用于获取此结果的查询。我愿意打赌应该有一种方法来优化它,因为它使用带有连接的子查询,而且从我读到的关于 BigQuery 的内容来看,最好避免连接。但首先是一些背景:
本质上,数据集代表一个仅附加的 table,其中写入了多个事件。数据的规模在数亿级,并将增长到数十亿+。由于 BigQuery 中的更新不实用,并且数据正在流式传输到 BQ,我需要一种方法来检索每个事件的最新事件,根据特定条件执行一些计算并 return 一个准确的结果。查询是根据用户输入动态生成的,因此可以包含更多 fields/calculations,但为简单起见已被省略。
- 只有一个
create
事件,但n
任何其他类型 - 对于每组事件,在计算时只应考虑最新的事件。
- status_update - 更新状态
- rate_update - 更新比率
- 创建 - 不言自明
- 每个不是
create
的事件可能不会携带 original/may 的其余信息不准确(除了 message_id 和事件正在操作的字段)(数据集被简化了,但想象一下有更多的列,以后会添加更多的事件)- 例如
rate_update
可能会或可能不会设置状态字段,或者不是最终值,因此无法对rate_update
事件的状态字段进行计算,同样适用于status_update
- 例如
- 可以假设 table 按日期分区,每个查询都将使用分区。为了简单起见,现在省略了这些条件。
所以我想我有几个问题:
- 如何优化这个查询?
- 除了
create
之外,将事件放在它们自己的 table 中是否更好,其中唯一可用的字段将是与事件相关的字段,并且需要联接(message_id、event_timestamp)?这会减少处理的数据量吗? - 将来添加更多事件的最佳方式是什么,这将有自己的条件和计算?
实际上,我们非常欢迎任何有关如何高效友好地查询此数据集的建议!谢谢! :)
我想到的怪物如下。 INNER JOINS
用于检索每行的最新版本,按照此 resource
select
sent_at as sent_at,
sum(submitted_msg) as submitted,
sum(delivered_msg) as delivered,
sum(sales_rate_total) as sales_rate_total
FROM (
#DELIVERED
SELECT
d.message_id,
FORMAT_TIMESTAMP('%Y-%m-%d 00:00:00', sent_at) AS sent_at,
0 as submitted_msg,
sum(if(status=1,1,0)) as delivered_msg,
0 as sales_rate_total
FROM `events` d
INNER JOIN
(
select message_id, max(event_timestamp) as ts
from `events`
where event_type = "status_update"
group by 1
) g on d.message_id = g.message_id and d.event_timestamp = g.ts
GROUP BY 1,2
UNION ALL
#SALES RATE
SELECT
s.message_id,
FORMAT_TIMESTAMP('%Y-%m-%d 00:00:00', sent_at) AS sent_at,
0 as submitted_msg,
0 as delivered_msg,
sum(sales_rate) as sales_rate_total
FROM `events` s
INNER JOIN
(
select message_id, max(event_timestamp) as ts
from `events`
where event_type in ("rate_update", "create")
group by 1
) f on s.message_id = f.message_id and s.event_timestamp = f.ts
GROUP BY 1,2
UNION ALL
#SUBMITTED & REST
SELECT
r.message_id,
FORMAT_TIMESTAMP('%Y-%m-%d 00:00:00', sent_at) AS sent_at,
sum(if(status=0,1,0)) as submitted_msg,
0 as delivered_msg,
0 as sales_rate_total
FROM `events` r
INNER JOIN
(
select message_id, max(event_timestamp) as ts
from `events`
where event_type = "create"
group by 1
) e on r.message_id = e.message_id and r.event_timestamp = e.ts
GROUP BY 1, 2
) k
group by 1
对于每个 table 保存多个事件并且我们需要选择最新事件的地方,我们都有一个视图。
查看: user_profile_latest
SELECT * from (
select rank() over (partition by user_id order by bq.created DESC, bq.insert_id desc) as _rank,
*
FROM [user_profile_event]
) where _rank=1
我们维护一个带有 created 和 insert_id 的记录 BQ,用于重复数据删除。
How can this query be optimized?
试试下面的版本
#standardSQL
WITH types AS (
SELECT
FORMAT_TIMESTAMP('%Y-%m-%d', sent_at) AS sent_at,
message_id,
FIRST_VALUE(status) OVER(PARTITION BY message_id ORDER BY (event_type = "create") DESC, event_timestamp DESC) AS submitted_status,
FIRST_VALUE(status) OVER(PARTITION BY message_id ORDER BY (event_type = "status_update") DESC, event_timestamp DESC) AS delivered_status,
FIRST_VALUE(rate) OVER(PARTITION BY message_id ORDER BY (event_type IN ("rate_update", "create")) DESC, event_timestamp DESC) AS sales_rate
FROM events
), latest AS (
SELECT
sent_at,
message_id,
ANY_VALUE(IF(submitted_status=0,1,0)) AS submitted,
ANY_VALUE(IF(delivered_status=1,1,0)) AS delivered,
ANY_VALUE(sales_rate) AS sales_rate
FROM types
GROUP BY 1, 2
)
SELECT
sent_at,
SUM(submitted) AS submitted,
SUM(delivered) AS delivered,
SUM(sales_rate) AS sales_rate_total
FROM latest
GROUP BY 1
它足够紧凑,易于管理,没有冗余,根本没有连接等。
如果您的 table 已分区 - 您可以通过仅在一个地方调整查询来轻松使用它
如果想先在低流量下检查上面的查询,您可以使用下面的虚拟数据
WITH events AS (
SELECT 1 AS id, 'create' AS event_type, TIMESTAMP '2016-11-25 09:17:48' AS event_timestamp, 1 AS message_id, TIMESTAMP '2016-11-25 09:17:48' AS sent_at, 0 AS status, 0.500000 AS rate UNION ALL
SELECT 2 AS id, 'status_update' AS event_type, TIMESTAMP '2016-11-25 09:24:38' AS event_timestamp, 1 AS message_id, TIMESTAMP '2016-11-25 09:28:49' AS sent_at, 1 AS status, 0.500000 AS rate UNION ALL
SELECT 3 AS id, 'create' AS event_type, TIMESTAMP '2016-11-25 09:47:48' AS event_timestamp, 2 AS message_id, TIMESTAMP '2016-11-25 09:47:48' AS sent_at, 0 AS status, 0.500000 AS rate UNION ALL
SELECT 4 AS id, 'status_update' AS event_type, TIMESTAMP '2016-11-25 09:54:38' AS event_timestamp, 2 AS message_id, TIMESTAMP '2016-11-25 09:48:49' AS sent_at, 1 AS status, 0.500000 AS rate UNION ALL
SELECT 5 AS id, 'rate_update' AS event_type, TIMESTAMP '2016-11-25 09:55:07' AS event_timestamp, 2 AS message_id, TIMESTAMP '2016-11-25 09:50:07' AS sent_at, 0 AS status, 1.000000 AS rate UNION ALL
SELECT 6 AS id, 'create' AS event_type, TIMESTAMP '2016-11-26 09:17:48' AS event_timestamp, 3 AS message_id, TIMESTAMP '2016-11-26 09:17:48' AS sent_at, 0 AS status, 0.500000 AS rate UNION ALL
SELECT 7 AS id, 'create' AS event_type, TIMESTAMP '2016-11-27 09:17:48' AS event_timestamp, 4 AS message_id, TIMESTAMP '2016-11-27 09:17:48' AS sent_at, 0 AS status, 0.500000 AS rate UNION ALL
SELECT 8 AS id, 'rate_update' AS event_type, TIMESTAMP '2016-11-27 09:55:07' AS event_timestamp, 4 AS message_id, TIMESTAMP '2016-11-27 09:50:07' AS sent_at, 0 AS status, 2.000000 AS rate UNION ALL
SELECT 9 AS id, 'rate_update' AS event_type, TIMESTAMP '2016-11-27 09:55:07' AS event_timestamp, 2 AS message_id, TIMESTAMP '2016-11-25 09:55:07' AS sent_at, 0 AS status, 2.000000 AS rate
)