将嵌套列添加到 BigQuery table,加入标准 SQL 中另一个嵌套列的值
Add nested column to BigQuery table, joining on value of another nested column in standard SQL
我有一个相当复杂的数据集被拉入 BigQuery table 通过一个不容易调整的气流 DAG。
此作业将数据拉入 table,格式如下:
| Line_item_id | Device |
|--------------|----------------|
| 123 | 202; 5; 100 |
| 124 | 100; 2 |
| 135 | 504; 202; 2 |
目前,我正在使用此查询(在 BQ Web UI 中以标准 SQL 编写)将设备 ID 拆分为单独的嵌套行:
SELECT
Line_item_id,
ARRAY(SELECT AS STRUCT(SPLIT(RTRIM(Device,';'),'; '))) as Device,
输出:
| Line_item_id | Device |
|--------------|--------|
| 123 | 202 |
| | 203 |
| | 504 |
| 124 | 102 |
| | 2 |
| 135 | 102 |
我面临的困难是我有一个单独的匹配项 table,其中包含设备 ID 及其对应的名称。我需要将设备名称添加到上面的 table,作为其相应 ID 旁边的嵌套值。
匹配 table 看起来像这样(有更多行):
| Device_id | Device_name |
|-----------|-------------|
| 202 | Smartphone |
| 203 | AppleTV |
| 504 | Laptop |
我正在寻找的理想输出是:
| Line_item_id | Device_id | Device_name |
|--------------|-----------|-------------|
| 123 | 202 | Android |
| | 203 | AppleTV |
| | 504 | Laptop |
| 124 | 102 | iphone |
| | 2 | Unknown |
| 135 | 102 | iphone |
如果有人知道如何实现这一点,我将不胜感激。
编辑:
Gordon 的解决方案非常有效,但除此之外,如果有人想在之后重新嵌套数据(这样你最终会得到相同的 table 和额外的嵌套行),这就是查询我终于得到了:
select t.line_item_id, ARRAY_AGG(STRUCT(d as id, ot.name as name)) as device
from first_table t cross join
unnest(split(Device, '; ')) d join
match_table ot
on ot.id = d
GROUP BY line_item_id
您可以将解析逻辑移动到 from
子句,然后 join
在您想要的位置:
select *
from (select 124 as line_item_id, '203; 100; 6; 2' as device) t cross join
unnest(split(device, '; ')) d join
other_table ot
on ot.device = d;
您需要 UNNEST
设备数组的内容,然后在加入 devices
元表后将其回滚:
select
line_item_id,
array_agg(struct(device_id as device_id, device_name as device_name)) as devices
from (
select
d.line_item_id,
device_id,
n.device_name
from `mydataset.basetable` d, unnest(d.device_ids) as device_id
left join `mydataset.devices_table` n on n.device_id = device_id
)
group by line_item_id
希望对您有所帮助。
以下适用于 BigQuery 标准 SQL。不需要 GROUP BY ...
#standardSQL
SELECT * EXCEPT(Device),
ARRAY(
SELECT AS STRUCT Device_id AS id, Device_name AS name
FROM UNNEST(SPLIT(REPLACE(Device, ' ', ''), ';')) Device_id WITH OFFSET
JOIN `project.dataset.devices`
USING(Device_id)
ORDER BY OFFSET
) Device
FROM `project.dataset.items`
如果应用于您问题中的样本数据 - 结果是
仅供参考:我使用以下数据进行测试
WITH `project.dataset.items` AS (
SELECT 123 Line_item_id, '202; 5; 100' Device UNION ALL
SELECT 124, '100; 2' UNION ALL
SELECT 135, '504; 202; 2'
), `project.dataset.devices` AS (
SELECT '202' Device_id, 'Smartphone' Device_name UNION ALL
SELECT '203', 'AppleTV' UNION ALL
SELECT '504', 'Laptop' UNION ALL
SELECT '5', 'abc' UNION ALL
SELECT '100', 'xyz' UNION ALL
SELECT '2', 'zzz'
)
我有一个相当复杂的数据集被拉入 BigQuery table 通过一个不容易调整的气流 DAG。
此作业将数据拉入 table,格式如下:
| Line_item_id | Device |
|--------------|----------------|
| 123 | 202; 5; 100 |
| 124 | 100; 2 |
| 135 | 504; 202; 2 |
目前,我正在使用此查询(在 BQ Web UI 中以标准 SQL 编写)将设备 ID 拆分为单独的嵌套行:
SELECT
Line_item_id,
ARRAY(SELECT AS STRUCT(SPLIT(RTRIM(Device,';'),'; '))) as Device,
输出:
| Line_item_id | Device |
|--------------|--------|
| 123 | 202 |
| | 203 |
| | 504 |
| 124 | 102 |
| | 2 |
| 135 | 102 |
我面临的困难是我有一个单独的匹配项 table,其中包含设备 ID 及其对应的名称。我需要将设备名称添加到上面的 table,作为其相应 ID 旁边的嵌套值。
匹配 table 看起来像这样(有更多行):
| Device_id | Device_name |
|-----------|-------------|
| 202 | Smartphone |
| 203 | AppleTV |
| 504 | Laptop |
我正在寻找的理想输出是:
| Line_item_id | Device_id | Device_name |
|--------------|-----------|-------------|
| 123 | 202 | Android |
| | 203 | AppleTV |
| | 504 | Laptop |
| 124 | 102 | iphone |
| | 2 | Unknown |
| 135 | 102 | iphone |
如果有人知道如何实现这一点,我将不胜感激。
编辑:
Gordon 的解决方案非常有效,但除此之外,如果有人想在之后重新嵌套数据(这样你最终会得到相同的 table 和额外的嵌套行),这就是查询我终于得到了:
select t.line_item_id, ARRAY_AGG(STRUCT(d as id, ot.name as name)) as device
from first_table t cross join
unnest(split(Device, '; ')) d join
match_table ot
on ot.id = d
GROUP BY line_item_id
您可以将解析逻辑移动到 from
子句,然后 join
在您想要的位置:
select *
from (select 124 as line_item_id, '203; 100; 6; 2' as device) t cross join
unnest(split(device, '; ')) d join
other_table ot
on ot.device = d;
您需要 UNNEST
设备数组的内容,然后在加入 devices
元表后将其回滚:
select
line_item_id,
array_agg(struct(device_id as device_id, device_name as device_name)) as devices
from (
select
d.line_item_id,
device_id,
n.device_name
from `mydataset.basetable` d, unnest(d.device_ids) as device_id
left join `mydataset.devices_table` n on n.device_id = device_id
)
group by line_item_id
希望对您有所帮助。
以下适用于 BigQuery 标准 SQL。不需要 GROUP BY ...
#standardSQL
SELECT * EXCEPT(Device),
ARRAY(
SELECT AS STRUCT Device_id AS id, Device_name AS name
FROM UNNEST(SPLIT(REPLACE(Device, ' ', ''), ';')) Device_id WITH OFFSET
JOIN `project.dataset.devices`
USING(Device_id)
ORDER BY OFFSET
) Device
FROM `project.dataset.items`
如果应用于您问题中的样本数据 - 结果是
仅供参考:我使用以下数据进行测试
WITH `project.dataset.items` AS (
SELECT 123 Line_item_id, '202; 5; 100' Device UNION ALL
SELECT 124, '100; 2' UNION ALL
SELECT 135, '504; 202; 2'
), `project.dataset.devices` AS (
SELECT '202' Device_id, 'Smartphone' Device_name UNION ALL
SELECT '203', 'AppleTV' UNION ALL
SELECT '504', 'Laptop' UNION ALL
SELECT '5', 'abc' UNION ALL
SELECT '100', 'xyz' UNION ALL
SELECT '2', 'zzz'
)