BigQuery 结构聚合
BigQuery Struct Aggregation
我正在 BigQuery 上处理 ETL 作业,我试图在其中协调可能存在冲突源的数据。我首先使用 array_agg(distinct my_column ignore nulls)
找出需要协调的位置,接下来我需要根据源对每列数据进行优先级排序。
我想 array_agg(struct(data_source, my_column))
并希望我可以轻松地提取给定列的首选源数据。但是,使用这种方法,我未能将数据聚合为结构,而是将数据聚合为结构数组。
考虑下面的简化示例,我更愿意从 HR 获得 job_title
和从 Canteen 获得 dietary_pref
:
with data_set as (
select 'John' as employee, 'Senior Manager' as job_title, 'vegan' as dietary_pref, 'HR' as source
union all
select 'John' as employee, 'Manager' as job_title, 'vegetarian' as dietary_pref, 'Canteen' as source
union all
select 'Mary' as employee, 'Marketing Director' as job_title, 'pescatarian' as dietary_pref, 'HR' as source
union all
select 'Mary' as employee, 'Marketing Manager' as job_title, 'gluten-free' as dietary_pref, 'Canteen' as source
)
select employee,
array_agg(struct(source, job_title)) as job_title,
array_agg(struct(source, dietary_pref)) as dietary_pref,
from data_set
group by employee
我得到的约翰的职位数据是:
[{'source':'HR', 'job_title':'Senior Manager'}, {'source': 'Canteen', 'job_title':'Manager'}]
而我正在努力实现:
[{'HR' : 'Senior Manager', 'Canteen' : 'Manager'}]
有了结构输出,我希望可以使用 my_struct.my_preferred_source
轻松访问首选源。在这种特殊情况下,我希望调用 job_title.HR
和 dietary_pref.Canteen
.
因此在伪SQL这里我想我会:
select employee,
AGGREGATE_JOB_TITLE_AS_STRUCT(source, job_title).HR as job_title,
AGGREGATE_DIETARY_PREF_AS_STRUCT(source, dietary_pref).Canteen as dietary_pref,
from data_set group by employee
输出将是:
我想在这里帮助解决这个问题。也许这完全是错误的方法,但考虑到我正在处理的更复杂的数据集,我认为这将是首选方法(尽管失败了)。
对替代方案持开放态度。请指教。谢谢
注意:我在 Mikhail 的回答后编辑了这个 post,它使用与我预期的略有不同的方法解决了我的问题,并添加了更多关于我打算使用单个结构的细节员工
考虑以下
select employee,
array_agg(struct(source as job_source, job_title) order by if(source = 'HR', 1, 2) limit 1)[offset(0)].*,
array_agg(struct(source as dietary_source, dietary_pref) order by if(source = 'HR', 2, 1) limit 1)[offset(0)].*
from data_set
group by employee
如果应用于您问题中的示例数据 - 输出为
更新:
使用下面的说明输出
select employee,
array_agg(job_title order by if(source = 'HR', 1, 2) limit 1)[offset(0)] as job_title,
array_agg(dietary_pref order by if(source = 'HR', 2, 1) limit 1)[offset(0)] as dietary_pref
from data_set
group by employee
有输出
我正在 BigQuery 上处理 ETL 作业,我试图在其中协调可能存在冲突源的数据。我首先使用 array_agg(distinct my_column ignore nulls)
找出需要协调的位置,接下来我需要根据源对每列数据进行优先级排序。
我想 array_agg(struct(data_source, my_column))
并希望我可以轻松地提取给定列的首选源数据。但是,使用这种方法,我未能将数据聚合为结构,而是将数据聚合为结构数组。
考虑下面的简化示例,我更愿意从 HR 获得 job_title
和从 Canteen 获得 dietary_pref
:
with data_set as (
select 'John' as employee, 'Senior Manager' as job_title, 'vegan' as dietary_pref, 'HR' as source
union all
select 'John' as employee, 'Manager' as job_title, 'vegetarian' as dietary_pref, 'Canteen' as source
union all
select 'Mary' as employee, 'Marketing Director' as job_title, 'pescatarian' as dietary_pref, 'HR' as source
union all
select 'Mary' as employee, 'Marketing Manager' as job_title, 'gluten-free' as dietary_pref, 'Canteen' as source
)
select employee,
array_agg(struct(source, job_title)) as job_title,
array_agg(struct(source, dietary_pref)) as dietary_pref,
from data_set
group by employee
我得到的约翰的职位数据是:
[{'source':'HR', 'job_title':'Senior Manager'}, {'source': 'Canteen', 'job_title':'Manager'}]
而我正在努力实现:
[{'HR' : 'Senior Manager', 'Canteen' : 'Manager'}]
有了结构输出,我希望可以使用 my_struct.my_preferred_source
轻松访问首选源。在这种特殊情况下,我希望调用 job_title.HR
和 dietary_pref.Canteen
.
因此在伪SQL这里我想我会:
select employee,
AGGREGATE_JOB_TITLE_AS_STRUCT(source, job_title).HR as job_title,
AGGREGATE_DIETARY_PREF_AS_STRUCT(source, dietary_pref).Canteen as dietary_pref,
from data_set group by employee
输出将是:
我想在这里帮助解决这个问题。也许这完全是错误的方法,但考虑到我正在处理的更复杂的数据集,我认为这将是首选方法(尽管失败了)。
对替代方案持开放态度。请指教。谢谢
注意:我在 Mikhail 的回答后编辑了这个 post,它使用与我预期的略有不同的方法解决了我的问题,并添加了更多关于我打算使用单个结构的细节员工
考虑以下
select employee,
array_agg(struct(source as job_source, job_title) order by if(source = 'HR', 1, 2) limit 1)[offset(0)].*,
array_agg(struct(source as dietary_source, dietary_pref) order by if(source = 'HR', 2, 1) limit 1)[offset(0)].*
from data_set
group by employee
如果应用于您问题中的示例数据 - 输出为
更新:
使用下面的说明输出
select employee,
array_agg(job_title order by if(source = 'HR', 1, 2) limit 1)[offset(0)] as job_title,
array_agg(dietary_pref order by if(source = 'HR', 2, 1) limit 1)[offset(0)] as dietary_pref
from data_set
group by employee
有输出