BigQuery 结构聚合

BigQuery Struct Aggregation

我正在 BigQuery 上处理 ETL 作业,我试图在其中协调可能存在冲突源的数据。我首先使用 array_agg(distinct my_column ignore nulls) 找出需要协调的位置,接下来我需要根据源对每列数据进行优先级排序。

我想 array_agg(struct(data_source, my_column)) 并希望我可以轻松地提取给定列的首选源数据。但是,使用这种方法,我未能将数据聚合为结构,而是将数据聚合为结构数组。

考虑下面的简化示例,我更愿意从 HR 获得 job_title 和从 Canteen 获得 dietary_pref

with data_set as (
    select 'John' as employee, 'Senior Manager' as job_title, 'vegan' as dietary_pref, 'HR' as source
    union all
    select 'John' as employee, 'Manager' as job_title, 'vegetarian' as dietary_pref, 'Canteen' as source
    union all
    select 'Mary' as employee, 'Marketing Director' as job_title, 'pescatarian' as dietary_pref, 'HR' as source
    union all
    select 'Mary' as employee, 'Marketing Manager' as job_title, 'gluten-free' as dietary_pref, 'Canteen' as source

)

select employee,
       array_agg(struct(source, job_title)) as job_title,
       array_agg(struct(source, dietary_pref)) as dietary_pref,
from data_set
group by employee

我得到的约翰的职位数据是: [{'source':'HR', 'job_title':'Senior Manager'}, {'source': 'Canteen', 'job_title':'Manager'}] 而我正在努力实现: [{'HR' : 'Senior Manager', 'Canteen' : 'Manager'}]

有了结构输出,我希望可以使用 my_struct.my_preferred_source 轻松访问首选源。在这种特殊情况下,我希望调用 job_title.HRdietary_pref.Canteen.

因此在伪SQL这里我想我会:

select employee,
        AGGREGATE_JOB_TITLE_AS_STRUCT(source, job_title).HR  as job_title,
        AGGREGATE_DIETARY_PREF_AS_STRUCT(source, dietary_pref).Canteen as dietary_pref, 
from data_set group by employee

输出将是:

我想在这里帮助解决这个问题。也许这完全是错误的方法,但考虑到我正在处理的更复杂的数据集,我认为这将是首选方法(尽管失败了)。

对替代方案持开放态度。请指教。谢谢

注意:我在 Mikhail 的回答后编辑了这个 post,它使用与我预期的略有不同的方法解决了我的问题,并添加了更多关于我打算使用单个结构的细节员工

考虑以下

select employee,
  array_agg(struct(source as job_source, job_title) order by if(source = 'HR', 1, 2) limit 1)[offset(0)].*,
  array_agg(struct(source as dietary_source, dietary_pref) order by if(source = 'HR', 2, 1) limit 1)[offset(0)].*
from data_set
group by employee                  

如果应用于您问题中的示例数据 - 输出为

更新:

使用下面的说明输出

select employee,
  array_agg(job_title order by if(source = 'HR', 1, 2) limit 1)[offset(0)] as job_title,
  array_agg(dietary_pref order by if(source = 'HR', 2, 1) limit 1)[offset(0)] as dietary_pref
from data_set
group by employee

有输出