根据应用于 BigQuery 中另一个嵌套列的条件从嵌套列中选择值

Selecting values from a nested column based on a condition applied to another nested column in BigQuery

如何使用嵌套列中“特殊”值的索引(例如:该嵌套列中最大值的索引)select使用该索引的另一个嵌套列中的值?

例如,考虑具有以下架构的 table:

Field name Type Mode
id STRING NULLABLE
username STRING NULLABLE
▼ products RECORD NULLABLE
     ▼ list RECORD REPEATED
            item STRING NULLABLE
▼ ordered RECORD NULLABLE
     ▼ list RECORD REPEATED
            item INTEGER NULLABLE
total_orders STRING NULLABLE
update_time TIMESTAMP NULLABLE
update_id INTEGER NULLABLE

前几行如下所示:

Row id username products.list.item ordered.list.item total_orders update_time update_id
1 1234 a_turing Apple 1 3 2021-08-14 20:03:22.100846 UTC 121231
      Orange 0      
      Pear 2      
2 5678 g_hopper Apple 0 2 2021-08-15 09:36:48.220464 UTC 121232
      Orange 2      
      Pear 0      
3 1122 a_lovelace Apple 0 1 2021-08-15 13:59:03.441506 UTC 121233
      Orange 1      
      Pear 0      
4 3344 v_nabokov Apple 1 2 2021-08-17 17:34:53.415406 UTC 121234
      Orange 0      
      Pear 1      

我想 select 每个 id 的最近订单的最常订购产品,并排除没有最常订购产品的订单(例如,如果客户订购了相同数量的 Apple、Orange 和梨).

我目前使用的查询是一个 CTE 链,每个产品类型一个,外加一个额外的列,即每个用户订购的产品的最大数量 (max_ordered)。然后我使用 id 列将 CTE 连接在一起:

WITH RANKED_ORDERS AS( 
SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY update_time DESC) AS rn
FROM mycompany.engagement.products_ordered),

LATEST_ORDERS AS(
SELECT * FROM RANKED_ORDERS WHERE rn = 1),

-- ---------------------- Apples Ordered -----------------------
APPLES_INDEXED AS(
SELECT id, username, ordered, flattened_products, offset_nk, update_time, rn
FROM LATEST_ORDERS
CROSS JOIN UNNEST(LATEST_ORDERS.products.list) AS flattened_products
WITH OFFSET as offset_nk
WHERE flattened_products.item in ('Apple')
ORDER BY offset_nk),

APPLES_ORDERED AS(
SELECT id, username, update_time, ordered.list[OFFSET(offset_nk)].item as apples_ordered 
FROM APPLES_INDEXED 
ORDER BY
update_time ASC),

-- ---------------------- Oranges Ordered ----------------------
ORANGES_INDEXED AS(
SELECT id, username, ordered, flattened_products, offset_nk, update_time, rn
FROM LATEST_ORDERS
CROSS JOIN UNNEST(LATEST_ORDERS.products.list) AS flattened_products
WITH OFFSET as offset_nk
WHERE flattened_products.item in ('Orange')
ORDER BY offset_nk),

ORANGES_ORDERED AS(
SELECT id, username, update_time, ordered.list[OFFSET(offset_nk)].item as oranges_ordered 
FROM ORANGES_INDEXED 
ORDER BY
update_time ASC),

-- ---------------------- Pears Ordered -----------------------
PEARS_INDEXED AS(
SELECT id, username, ordered, flattened_products, offset_nk, update_time, rn
FROM LATEST_ORDERS
CROSS JOIN UNNEST(LATEST_ORDERS.products.list) AS flattened_products
WITH OFFSET as offset_nk
WHERE flattened_products.item in ('Pear')
ORDER BY offset_nk),

PEARS_ORDERED AS(
SELECT id, username, update_time, ordered.list[OFFSET(offset_nk)].item as pears_ordered 
FROM PEARS_INDEXED 
ORDER BY
update_time ASC),

-- --------------- Max Product Ordered per Order --------------
MAX_ORDERED AS(
SELECT
id, username, MAX(orders_per_username.item) as max_ordered, total_orders
FROM
LATEST_ORDERS, UNNEST(ordered.list) as orders_per_username
GROUP BY id, username, total_orders),

-- -------------------- Orders In Columns ---------------------
ORDERS_IN_COLUMNS AS(
SELECT APPLES_ORDERED.username, APPLES_ORDERED.update_time, APPLES_ORDERED.apples_ordered,
ORANGES_ORDERED.oranges_ordered, PEARS_ORDERED.pears_ordered, MAX_ORDERED.max_ordered
FROM APPLES_ORDERED
LEFT JOIN ORANGES_ORDERED ON ORANGES_ORDERED.id = APPLES_ORDERED.id
LEFT JOIN PEARS_ORDERED ON PEARS_ORDERED.id = APPLES_ORDERED.id
LEFT JOIN MAX_ORDERED ON MAX_ORDERED.id = APPLES_ORDERED.id),

-- ------- Orders with a most ordered product -----------------
NO_CONFLICTS AS(
SELECT * FROM ORDERS_IN_COLUMNS
WHERE
max_ordered > 0 AND
(
    (apples_ordered not in (oranges_ordered, pears_ordered) AND apples_ordered = max_ordered)
OR
    (oranges_ordered not in (apples_ordered, pears_ordered) AND oranges_ordered = max_ordered)
OR
    (pears_ordered not in (apples_ordered, oranges_ordered) AND pears_ordered = max_ordered)
)
)

SELECT * FROM NO_CONFLICTS

此 return 是以下 table:

Row username update_time apples_ordered oranges_ordered pears_ordered max_ordered
1 a_turing 2021-08-14 20:03:22.100846 UTC 1 0 2 2
2 g_hopper 2021-08-15 09:36:48.220464 UTC 0 2 0 2
3 a_lovelace 2021-08-15 13:59:03.441506 UTC 0 1 0 1

这是意料之中的。
但是,我无法弄清楚如何简单地 return a table 看起来像:

Row username update_time max_product_ordered
1 a_turing 2021-08-14 20:03:22.100846 UTC Pear
2 g_hopper 2021-08-15 09:36:48.220464 UTC Orange
3 a_lovelace 2021-08-15 13:59:03.441506 UTC Orange

我也相当确定,虽然这个查询基本上有效(我最终在 Python 中进行 post 处理以到达最后一步)它 可能 鉴于“常见 table 表达式”的广泛使用,效率极低。
是否有比我编写的方法更有效的查询 BigQuery table 的方法,或者我是否需要完全重组 table 以获得任何加速?目前,运行 这个查询需要大约 10 秒才能完成 table 上的查询,大约有 10,000 行和 12 列,我认为这种缓慢是由于多个 CTE。
在过去的两周里,我一直在用头撞墙,试图改进我的查询,但没有取得太大进展。 真诚地感谢任何帮助!

考虑以下方法

with latest_orders as (
  select * from `mycompany.engagement.products_ordered`
  where true 
  qualify 1 = row_number() over(partition by id order by update_time desc)
), qualified_items as (
  select *, 
    array(
      select offset from t.ordered.list with offset 
      where true 
      qualify 1 = rank() over(order by item desc) 
    ) items
  from latest_orders t
)
select id, username, update_time,
  products.list[offset(items[offset(0)])] as max_product_ordered,
from qualified_items
where array_length(items) = 1    

如果应用于您问题中的示例数据 - 输出为