根据应用于 BigQuery 中另一个嵌套列的条件从嵌套列中选择值
Selecting values from a nested column based on a condition applied to another nested column in BigQuery
如何使用嵌套列中“特殊”值的索引(例如:该嵌套列中最大值的索引)select使用该索引的另一个嵌套列中的值?
例如,考虑具有以下架构的 table:
Field name
Type
Mode
id
STRING
NULLABLE
username
STRING
NULLABLE
▼ products
RECORD
NULLABLE
▼ list
RECORD
REPEATED
item
STRING
NULLABLE
▼ ordered
RECORD
NULLABLE
▼ list
RECORD
REPEATED
item
INTEGER
NULLABLE
total_orders
STRING
NULLABLE
update_time
TIMESTAMP
NULLABLE
update_id
INTEGER
NULLABLE
前几行如下所示:
Row
id
username
products.list.item
ordered.list.item
total_orders
update_time
update_id
1
1234
a_turing
Apple
1
3
2021-08-14 20:03:22.100846 UTC
121231
Orange
0
Pear
2
2
5678
g_hopper
Apple
0
2
2021-08-15 09:36:48.220464 UTC
121232
Orange
2
Pear
0
3
1122
a_lovelace
Apple
0
1
2021-08-15 13:59:03.441506 UTC
121233
Orange
1
Pear
0
4
3344
v_nabokov
Apple
1
2
2021-08-17 17:34:53.415406 UTC
121234
Orange
0
Pear
1
我想 select 每个 id 的最近订单的最常订购产品,并排除没有最常订购产品的订单(例如,如果客户订购了相同数量的 Apple、Orange 和梨).
我目前使用的查询是一个 CTE 链,每个产品类型一个,外加一个额外的列,即每个用户订购的产品的最大数量 (max_ordered)。然后我使用 id 列将 CTE 连接在一起:
WITH RANKED_ORDERS AS(
SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY update_time DESC) AS rn
FROM mycompany.engagement.products_ordered),
LATEST_ORDERS AS(
SELECT * FROM RANKED_ORDERS WHERE rn = 1),
-- ---------------------- Apples Ordered -----------------------
APPLES_INDEXED AS(
SELECT id, username, ordered, flattened_products, offset_nk, update_time, rn
FROM LATEST_ORDERS
CROSS JOIN UNNEST(LATEST_ORDERS.products.list) AS flattened_products
WITH OFFSET as offset_nk
WHERE flattened_products.item in ('Apple')
ORDER BY offset_nk),
APPLES_ORDERED AS(
SELECT id, username, update_time, ordered.list[OFFSET(offset_nk)].item as apples_ordered
FROM APPLES_INDEXED
ORDER BY
update_time ASC),
-- ---------------------- Oranges Ordered ----------------------
ORANGES_INDEXED AS(
SELECT id, username, ordered, flattened_products, offset_nk, update_time, rn
FROM LATEST_ORDERS
CROSS JOIN UNNEST(LATEST_ORDERS.products.list) AS flattened_products
WITH OFFSET as offset_nk
WHERE flattened_products.item in ('Orange')
ORDER BY offset_nk),
ORANGES_ORDERED AS(
SELECT id, username, update_time, ordered.list[OFFSET(offset_nk)].item as oranges_ordered
FROM ORANGES_INDEXED
ORDER BY
update_time ASC),
-- ---------------------- Pears Ordered -----------------------
PEARS_INDEXED AS(
SELECT id, username, ordered, flattened_products, offset_nk, update_time, rn
FROM LATEST_ORDERS
CROSS JOIN UNNEST(LATEST_ORDERS.products.list) AS flattened_products
WITH OFFSET as offset_nk
WHERE flattened_products.item in ('Pear')
ORDER BY offset_nk),
PEARS_ORDERED AS(
SELECT id, username, update_time, ordered.list[OFFSET(offset_nk)].item as pears_ordered
FROM PEARS_INDEXED
ORDER BY
update_time ASC),
-- --------------- Max Product Ordered per Order --------------
MAX_ORDERED AS(
SELECT
id, username, MAX(orders_per_username.item) as max_ordered, total_orders
FROM
LATEST_ORDERS, UNNEST(ordered.list) as orders_per_username
GROUP BY id, username, total_orders),
-- -------------------- Orders In Columns ---------------------
ORDERS_IN_COLUMNS AS(
SELECT APPLES_ORDERED.username, APPLES_ORDERED.update_time, APPLES_ORDERED.apples_ordered,
ORANGES_ORDERED.oranges_ordered, PEARS_ORDERED.pears_ordered, MAX_ORDERED.max_ordered
FROM APPLES_ORDERED
LEFT JOIN ORANGES_ORDERED ON ORANGES_ORDERED.id = APPLES_ORDERED.id
LEFT JOIN PEARS_ORDERED ON PEARS_ORDERED.id = APPLES_ORDERED.id
LEFT JOIN MAX_ORDERED ON MAX_ORDERED.id = APPLES_ORDERED.id),
-- ------- Orders with a most ordered product -----------------
NO_CONFLICTS AS(
SELECT * FROM ORDERS_IN_COLUMNS
WHERE
max_ordered > 0 AND
(
(apples_ordered not in (oranges_ordered, pears_ordered) AND apples_ordered = max_ordered)
OR
(oranges_ordered not in (apples_ordered, pears_ordered) AND oranges_ordered = max_ordered)
OR
(pears_ordered not in (apples_ordered, oranges_ordered) AND pears_ordered = max_ordered)
)
)
SELECT * FROM NO_CONFLICTS
此 return 是以下 table:
Row
username
update_time
apples_ordered
oranges_ordered
pears_ordered
max_ordered
1
a_turing
2021-08-14 20:03:22.100846 UTC
1
0
2
2
2
g_hopper
2021-08-15 09:36:48.220464 UTC
0
2
0
2
3
a_lovelace
2021-08-15 13:59:03.441506 UTC
0
1
0
1
这是意料之中的。
但是,我无法弄清楚如何简单地 return a table 看起来像:
Row
username
update_time
max_product_ordered
1
a_turing
2021-08-14 20:03:22.100846 UTC
Pear
2
g_hopper
2021-08-15 09:36:48.220464 UTC
Orange
3
a_lovelace
2021-08-15 13:59:03.441506 UTC
Orange
我也相当确定,虽然这个查询基本上有效(我最终在 Python 中进行 post 处理以到达最后一步)它 可能 鉴于“常见 table 表达式”的广泛使用,效率极低。
是否有比我编写的方法更有效的查询 BigQuery table 的方法,或者我是否需要完全重组 table 以获得任何加速?目前,运行 这个查询需要大约 10 秒才能完成 table 上的查询,大约有 10,000 行和 12 列,我认为这种缓慢是由于多个 CTE。
在过去的两周里,我一直在用头撞墙,试图改进我的查询,但没有取得太大进展。 真诚地感谢任何帮助!
考虑以下方法
with latest_orders as (
select * from `mycompany.engagement.products_ordered`
where true
qualify 1 = row_number() over(partition by id order by update_time desc)
), qualified_items as (
select *,
array(
select offset from t.ordered.list with offset
where true
qualify 1 = rank() over(order by item desc)
) items
from latest_orders t
)
select id, username, update_time,
products.list[offset(items[offset(0)])] as max_product_ordered,
from qualified_items
where array_length(items) = 1
如果应用于您问题中的示例数据 - 输出为
如何使用嵌套列中“特殊”值的索引(例如:该嵌套列中最大值的索引)select使用该索引的另一个嵌套列中的值?
例如,考虑具有以下架构的 table:
Field name | Type | Mode |
---|---|---|
id | STRING | NULLABLE |
username | STRING | NULLABLE |
▼ products | RECORD | NULLABLE |
▼ list | RECORD | REPEATED |
item | STRING | NULLABLE |
▼ ordered | RECORD | NULLABLE |
▼ list | RECORD | REPEATED |
item | INTEGER | NULLABLE |
total_orders | STRING | NULLABLE |
update_time | TIMESTAMP | NULLABLE |
update_id | INTEGER | NULLABLE |
前几行如下所示:
Row | id | username | products.list.item | ordered.list.item | total_orders | update_time | update_id |
---|---|---|---|---|---|---|---|
1 | 1234 | a_turing | Apple | 1 | 3 | 2021-08-14 20:03:22.100846 UTC | 121231 |
Orange | 0 | ||||||
Pear | 2 | ||||||
2 | 5678 | g_hopper | Apple | 0 | 2 | 2021-08-15 09:36:48.220464 UTC | 121232 |
Orange | 2 | ||||||
Pear | 0 | ||||||
3 | 1122 | a_lovelace | Apple | 0 | 1 | 2021-08-15 13:59:03.441506 UTC | 121233 |
Orange | 1 | ||||||
Pear | 0 | ||||||
4 | 3344 | v_nabokov | Apple | 1 | 2 | 2021-08-17 17:34:53.415406 UTC | 121234 |
Orange | 0 | ||||||
Pear | 1 |
我想 select 每个 id 的最近订单的最常订购产品,并排除没有最常订购产品的订单(例如,如果客户订购了相同数量的 Apple、Orange 和梨).
我目前使用的查询是一个 CTE 链,每个产品类型一个,外加一个额外的列,即每个用户订购的产品的最大数量 (max_ordered)。然后我使用 id 列将 CTE 连接在一起:
WITH RANKED_ORDERS AS(
SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY update_time DESC) AS rn
FROM mycompany.engagement.products_ordered),
LATEST_ORDERS AS(
SELECT * FROM RANKED_ORDERS WHERE rn = 1),
-- ---------------------- Apples Ordered -----------------------
APPLES_INDEXED AS(
SELECT id, username, ordered, flattened_products, offset_nk, update_time, rn
FROM LATEST_ORDERS
CROSS JOIN UNNEST(LATEST_ORDERS.products.list) AS flattened_products
WITH OFFSET as offset_nk
WHERE flattened_products.item in ('Apple')
ORDER BY offset_nk),
APPLES_ORDERED AS(
SELECT id, username, update_time, ordered.list[OFFSET(offset_nk)].item as apples_ordered
FROM APPLES_INDEXED
ORDER BY
update_time ASC),
-- ---------------------- Oranges Ordered ----------------------
ORANGES_INDEXED AS(
SELECT id, username, ordered, flattened_products, offset_nk, update_time, rn
FROM LATEST_ORDERS
CROSS JOIN UNNEST(LATEST_ORDERS.products.list) AS flattened_products
WITH OFFSET as offset_nk
WHERE flattened_products.item in ('Orange')
ORDER BY offset_nk),
ORANGES_ORDERED AS(
SELECT id, username, update_time, ordered.list[OFFSET(offset_nk)].item as oranges_ordered
FROM ORANGES_INDEXED
ORDER BY
update_time ASC),
-- ---------------------- Pears Ordered -----------------------
PEARS_INDEXED AS(
SELECT id, username, ordered, flattened_products, offset_nk, update_time, rn
FROM LATEST_ORDERS
CROSS JOIN UNNEST(LATEST_ORDERS.products.list) AS flattened_products
WITH OFFSET as offset_nk
WHERE flattened_products.item in ('Pear')
ORDER BY offset_nk),
PEARS_ORDERED AS(
SELECT id, username, update_time, ordered.list[OFFSET(offset_nk)].item as pears_ordered
FROM PEARS_INDEXED
ORDER BY
update_time ASC),
-- --------------- Max Product Ordered per Order --------------
MAX_ORDERED AS(
SELECT
id, username, MAX(orders_per_username.item) as max_ordered, total_orders
FROM
LATEST_ORDERS, UNNEST(ordered.list) as orders_per_username
GROUP BY id, username, total_orders),
-- -------------------- Orders In Columns ---------------------
ORDERS_IN_COLUMNS AS(
SELECT APPLES_ORDERED.username, APPLES_ORDERED.update_time, APPLES_ORDERED.apples_ordered,
ORANGES_ORDERED.oranges_ordered, PEARS_ORDERED.pears_ordered, MAX_ORDERED.max_ordered
FROM APPLES_ORDERED
LEFT JOIN ORANGES_ORDERED ON ORANGES_ORDERED.id = APPLES_ORDERED.id
LEFT JOIN PEARS_ORDERED ON PEARS_ORDERED.id = APPLES_ORDERED.id
LEFT JOIN MAX_ORDERED ON MAX_ORDERED.id = APPLES_ORDERED.id),
-- ------- Orders with a most ordered product -----------------
NO_CONFLICTS AS(
SELECT * FROM ORDERS_IN_COLUMNS
WHERE
max_ordered > 0 AND
(
(apples_ordered not in (oranges_ordered, pears_ordered) AND apples_ordered = max_ordered)
OR
(oranges_ordered not in (apples_ordered, pears_ordered) AND oranges_ordered = max_ordered)
OR
(pears_ordered not in (apples_ordered, oranges_ordered) AND pears_ordered = max_ordered)
)
)
SELECT * FROM NO_CONFLICTS
此 return 是以下 table:
Row | username | update_time | apples_ordered | oranges_ordered | pears_ordered | max_ordered |
---|---|---|---|---|---|---|
1 | a_turing | 2021-08-14 20:03:22.100846 UTC | 1 | 0 | 2 | 2 |
2 | g_hopper | 2021-08-15 09:36:48.220464 UTC | 0 | 2 | 0 | 2 |
3 | a_lovelace | 2021-08-15 13:59:03.441506 UTC | 0 | 1 | 0 | 1 |
这是意料之中的。
但是,我无法弄清楚如何简单地 return a table 看起来像:
Row | username | update_time | max_product_ordered |
---|---|---|---|
1 | a_turing | 2021-08-14 20:03:22.100846 UTC | Pear |
2 | g_hopper | 2021-08-15 09:36:48.220464 UTC | Orange |
3 | a_lovelace | 2021-08-15 13:59:03.441506 UTC | Orange |
我也相当确定,虽然这个查询基本上有效(我最终在 Python 中进行 post 处理以到达最后一步)它 可能 鉴于“常见 table 表达式”的广泛使用,效率极低。
是否有比我编写的方法更有效的查询 BigQuery table 的方法,或者我是否需要完全重组 table 以获得任何加速?目前,运行 这个查询需要大约 10 秒才能完成 table 上的查询,大约有 10,000 行和 12 列,我认为这种缓慢是由于多个 CTE。
在过去的两周里,我一直在用头撞墙,试图改进我的查询,但没有取得太大进展。 真诚地感谢任何帮助!
考虑以下方法
with latest_orders as (
select * from `mycompany.engagement.products_ordered`
where true
qualify 1 = row_number() over(partition by id order by update_time desc)
), qualified_items as (
select *,
array(
select offset from t.ordered.list with offset
where true
qualify 1 = rank() over(order by item desc)
) items
from latest_orders t
)
select id, username, update_time,
products.list[offset(items[offset(0)])] as max_product_ordered,
from qualified_items
where array_length(items) = 1
如果应用于您问题中的示例数据 - 输出为