bigquerqy sql link 群体之间的共同点 grid_id PART II

bigquerqy sql link a common grid_id between groups PART II

从第 1 部分获得以下结果。

with t1 as
(
Select 'obrien-t j' lname_forename_long,11 grid_id_ct ,'grid.416153.4' grid_id,2 name_seq ,1 group_seq UNION ALL
Select 'obrien-t j',1,'grid.1002.3',1,1 UNION ALL
Select 'obrien-terence',2,'grid.1008.9',1,2 UNION ALL
Select 'obrien-terence',4,'grid.416153.4',2,2 UNION ALL
Select 'obrien-terence',1,'grid.484852.7',3,2 UNION ALL
Select 'obrien-terence j',14,'grid.1002.3',1,3 UNION ALL
Select 'obrien-terence j',25,'grid.1008.9',2,3 UNION ALL
Select 'obrien-terence j',3,'grid.1019.9',3,3 UNION ALL
Select 'obrien-terence j',9,'grid.1623.6',4,3 UNION ALL
Select 'obrien-terence j',40,'grid.237081.f',5,3 UNION ALL
Select 'obrien-terence j',1,'grid.267362.4',6,3 UNION ALL
Select 'obrien-terence j',2,'grid.414094.c',7,3 UNION ALL
Select 'obrien-terence j',1,'grid.416060.5',8,3 UNION ALL
Select 'obrien-terence j',36,'grid.416153.4',9,3 UNION ALL
Select 'obrien-terence j',4,'grid.453219.8',10,3 UNION ALL
Select 'obrien-terence j',3,'grid.454055.5',11,3 UNION ALL
Select 'obrien-terence j',6,'grid.474069.8',12,3 UNION ALL
Select 'obrien-terence j',13,'grid.481253.9',13,3 UNION ALL
Select 'obrien-terence john',1,'grid.1002.3',1,4 UNION ALL
Select 'obrien-terence john',1,'grid.1008.9',2,4 UNION ALL
Select 'obrien-terence john',1,'grid.1623.6',3,4 UNION ALL
Select 'obrien-terence john',1,'grid.237081.f',4,4 UNION ALL
Select 'obrien-terence john',2,'grid.416153.4',5,4 UNION ALL
Select 'obrien-terrence',2,'grid.416153.4',1,5 UNION ALL
Select 'obrien-terrence j',1,'grid.416153.4',1,6 UNION ALL
Select 'obrien-terry',1,'grid.137628.9',1,7 UNION ALL
Select 'obrien-terry',2,'grid.237081.f',2,7 UNION ALL
Select 'obrien-terry',1,'grid.267362.4',3,7 UNION ALL
Select 'obrien-timothy',1,'grid.496867.2',1,8 UNION ALL
Select 'obrien-timothy',3,'grid.6142.1',2,8 
) 
select *, if(count(*) over win > 0, string_agg('' || group_seq) over win, '') links
from t1
window win as (partition by grid_id)    ;

以上不包括我认为可能需要的计数列。

lname_forename_long grid_id_ct grid_id name_seq group_seq links link_counts
obrien-t j 11 grid.416153.4 2 1 1,2,3,4,5,6 6
obrien-t j 1 grid.1002.3 1 1 1,3,4 3
obrien-terence 4 grid.416153.4 2 2 1,2,3,4,5,6 6
obrien-terence 2 grid.1008.9 1 2 2,3,4 3
obrien-terence 1 grid.484852.7 3 2 2 1
obrien-terence j 36 grid.416153.4 9 3 1,2,3,4,5,6 6
obrien-terence j 14 grid.1002.3 1 3 1,3,4 3
obrien-terence j 25 grid.1008.9 2 3 2,3,4 3
obrien-terence j 40 grid.237081.f 5 3 3,4,7 3
obrien-terence j 9 grid.1623.6 4 3 3,4 2
obrien-terence j 1 grid.267362.4 6 3 3,7 2
obrien-terence j 3 grid.1019.9 3 3 3 1
obrien-terence j 2 grid.414094.c 7 3 3 1
obrien-terence j 1 grid.416060.5 8 3 3 1
obrien-terence j 4 grid.453219.8 10 3 3 1
obrien-terence j 3 grid.454055.5 11 3 3 1
obrien-terence j 6 grid.474069.8 12 3 3 1
obrien-terence j 13 grid.481253.9 13 3 3 1
obrien-terence john 2 grid.416153.4 5 4 1,2,3,4,5,6 6
obrien-terence john 1 grid.1002.3 1 4 1,3,4 3
obrien-terence john 1 grid.1008.9 2 4 2,3,4 3
obrien-terence john 1 grid.237081.f 4 4 3,4,7 3
obrien-terence john 1 grid.1623.6 3 4 3,4 2
obrien-terrence 2 grid.416153.4 1 5 1,2,3,4,5,6 6
obrien-terrence j 1 grid.416153.4 1 6 1,2,3,4,5,6 6
obrien-terry 2 grid.237081.f 2 7 3,4,7 3
obrien-terry 1 grid.267362.4 3 7 3,7 2
obrien-terry 1 grid.137628.9 1 7 7 1
obrien-timothy 3 grid.6142.1 2 8 8 1
obrien-timothy 1 grid.496867.2 1 8 8 1

第二部分取所有最大的名字(link_counts)

lname_forename_long grid_id_ct grid_id name_seq group_seq links link_counts
obrien-t j 11 grid.416153.4 2 1 1,2,3,4,5,6 6
obrien-terence 4 grid.416153.4 2 2 1,2,3,4,5,6 6
obrien-terence j 36 grid.416153.4 9 3 1,2,3,4,5,6 6
obrien-terence john 2 grid.416153.4 5 4 1,2,3,4,5,6 6
obrien-terrence 2 grid.416153.4 1 5 1,2,3,4,5,6 6
obrien-terrence j 1 grid.416153.4 1 6 1,2,3,4,5,6 6

添加不在max(link_counts) = 6中的名字 选择最高 grid_id_ct 的 nmaes 给予。

lname_forename_long grid_id_ct grid_id name_seq group_seq links link_counts
obrien-timothy 3 grid.6142.1 2 8 8 1
obrien-terry 2 grid.237081.f 2 7 3,4,7 3
obrien-terrence j 1 grid.416153.4 1 6 1,2,3,4,5,6 6
obrien-terrence 2 grid.416153.4 1 5 1,2,3,4,5,6 6
obrien-terence john 2 grid.416153.4 5 4 1,2,3,4,5,6 6
obrien-terence j 36 grid.416153.4 9 3 1,2,3,4,5,6 6
obrien-terence 4 grid.416153.4 2 2 1,2,3,4,5,6 6
obrien-t j 11 grid.416153.4 2 1 1,2,3,4,5,6 6

如果任何新名称可以 link 到 link_counts = 6 更新可以相交的 link 列。

lname_forename_long grid_id_ct grid_id name_seq group_seq links link_counts is_intersect_links
obrien-timothy 3 grid.6142.1 2 8 8 1
obrien-terry 2 grid.237081.f 2 7 3,4,7 3 3,4
obrien-terrence j 1 grid.416153.4 1 6 1,2,3,4,5,6 6 3,4
obrien-terrence 2 grid.416153.4 1 5 1,2,3,4,5,6 6 3,4
obrien-terence john 2 grid.416153.4 5 4 1,2,3,4,5,6 6 3,4
obrien-terence j 36 grid.416153.4 9 3 1,2,3,4,5,6 6 3,4
obrien-terence 4 grid.416153.4 2 2 1,2,3,4,5,6 6 3,4
obrien-t j 11 grid.416153.4 2 1 1,2,3,4,5,6 6 3,4

因为我们现在可以 link obrien-terry 到另一个 obrien-t..... 名称更新他的 grid_id 与 obrien-t..... 网格相同.416153.4

lname_forename_long grid_id_ct grid_id name_seq group_seq links link_counts is_intersect_links is_merged
obrien-timothy 3 grid.6142.1 2 8 8 1 '' FALSE
obrien-terry 2 grid.416153.4 2 7 3,4,7 3 3,4 TRUE
obrien-terrence j 1 grid.416153.4 1 6 1,2,3,4,5,6 6 3,4 FALSE
obrien-terrence 2 grid.416153.4 1 5 1,2,3,4,5,6 6 3,4 FALSE
obrien-terence john 2 grid.416153.4 5 4 1,2,3,4,5,6 6 3,4 FALSE
obrien-terence j 36 grid.416153.4 9 3 1,2,3,4,5,6 6 3,4 FALSE
obrien-terence 4 grid.416153.4 2 2 1,2,3,4,5,6 6 3,4 FALSE
obrien-t j 11 grid.416153.4 2 1 1,2,3,4,5,6 6 3,4 FALSE

我还加了is_merged表示更新了一个grid_id。 我已经添加了多个步骤以使其清楚,但它可能是一个或两个步骤。 我已经尝试了多种方法来使用 cartesain joins、intersect distinct 来找到名称之间的公共网格,但它们都不够用。 简单来说,我试图找出我有多少独特的 obriens,基于能够将它们分配给一个共同的 grid_id,这基本上是一个地址。

我不确定所有中间步骤是否过于复杂。我不需要我只需要以 .

结尾的所有元数据列
lname_forename_long grid_id is_merged
obrien-timothy grid.6142.1 FALSE
obrien-terry grid.416153.4 TRUE
obrien-terrence j grid.416153.4 FALSE
obrien-terrence grid.416153.4 FALSE
obrien-terence john grid.416153.4 FALSE
obrien-terence j grid.416153.4 FALSE
obrien-terence grid.416153.4 FALSE
obrien-t j grid.416153.4 FALSE

我为塞缪尔所做的努力。

with t2 as (
with t1 as
(
Select "o'brien-t j" lname,11 grid_ct ,'grid.416153.4' grid_id,2 name_seq ,1 group_seq ,'1,2,3,4,5,6' links UNION ALL
Select "o'brien-terence",2,'grid.1008.9',1,2,'' UNION ALL
Select "o'brien-terence",4,'grid.416153.4',2,2,'' UNION ALL
Select "o'brien-terence",1,'grid.484852.7',3,2,'1,2,3,4,5,6' UNION ALL
Select "o'brien-terence j",14,'grid.1002.3',1,3,'3,7' UNION ALL
Select "o'brien-terence j",25,'grid.1008.9',2,3,'' UNION ALL
Select "o'brien-terence j",3,'grid.1019.9',3,3,'' UNION ALL
Select "o'brien-terence j",9,'grid.1623.6',4,3,'' UNION ALL
Select "o'brien-terence j",40,'grid.237081.f',5,3,'' UNION ALL
Select "o'brien-terence j",1,'grid.267362.4',6,3,'' UNION ALL
Select "o'brien-terence j",2,'grid.414094.c',7,3,'' UNION ALL
Select "o'brien-terence j",1,'grid.416060.5',8,3,'' UNION ALL
Select "o'brien-terence j",36,'grid.416153.4',9,3,'' UNION ALL
Select "o'brien-terence j",4,'grid.453219.8',10,3,'' UNION ALL
Select "o'brien-terence j",3,'grid.454055.5',11,3,'' UNION ALL
Select "o'brien-terence j",6,'grid.474069.8',12,3,'1,2,3,4,5,6' UNION ALL
Select "o'brien-terence j",13,'grid.481253.9',13,3,'3,4' UNION ALL
Select "o'brien-terence john",1,'grid.1002.3',1,4,'' UNION ALL
Select "o'brien-terence john",1,'grid.1008.9',2,4,'' UNION ALL
Select "o'brien-terence john",1,'grid.1623.6',3,4,'' UNION ALL
Select "o'brien-terence john",1,'grid.237081.f',4,4,'3,4' UNION ALL
Select "o'brien-terence john",2,'grid.416153.4',5,4,'1,2,3,4,5,6' UNION ALL
Select "o'brien-terrence",2,'grid.416153.4',1,5,'1,2,3,4,5,6' UNION ALL
Select "o'brien-terrence j",1,'grid.416153.4',1,6,'1,2,3,4,5,6' UNION ALL
Select "o'brien-terry",1,'grid.137628.9',1,7,'' UNION ALL
Select "o'brien-terry",2,'grid.237081.f',2,7,'3,7' UNION ALL
Select "o'brien-terry",1,'grid.267362.4',3,7,'' UNION ALL
Select "o'brien-timothy",1,'grid.496867.2',1,8,'' UNION ALL
Select "o'brien-timothy",3,'grid.6142.1',2,8,''
)
 select distinct a.lname, a.grid_id
 from t1 a, t1 b
 where a.lname <> b.lname
 and a.grid_id = b.grid_id
)
 select  distinct  lname,
 grid_id ,
 DENSE_RANK() OVER
                   (
                   --PARTITION BY a.lname_init1
                   ORDER BY grid_id
                   )  seq_num,
 from t2
)
select
'matched' is_matched,
lname
,grid_id
,seq_num
from t3
group by lname  ,grid_id,seq_num
having seq_num = (select max(seq_num )x from t3)
------------------------------------------
union all
--intersect distinct
------------------------------------------
select
'not_matched' is_matched,
lname
,grid_id
,seq_num
from t3
group by lname  ,grid_id,seq_num
having seq_num != (select max(seq_num )x from t3);

我的结果。我不知道如何将 o'brien-terry 合并到匹配的组中。它还错过了 o'brien-timothy

is_matched lname grid_id seq_num
not_matched o'brien-terence j grid.1002.3 1
not_matched o'brien-terence john grid.1002.3 1
not_matched o'brien-terence grid.1008.9 2
not_matched o'brien-terence j grid.1008.9 2
not_matched o'brien-terence john grid.1008.9 2
not_matched o'brien-terence j grid.1623.6 3
not_matched o'brien-terence john grid.1623.6 3
not_matched o'brien-terence j grid.237081.f 4
not_matched o'brien-terence john grid.237081.f 4
not_matched o'brien-terry grid.237081.f 4
not_matched o'brien-terence j grid.267362.4 5
not_matched o'brien-terry grid.267362.4 5
matched o'brien-t j grid.416153.4 6
matched o'brien-terence grid.416153.4 6
matched o'brien-terence j grid.416153.4 6
matched o'brien-terence john grid.416153.4 6
matched o'brien-terrence grid.416153.4 6
matched o'brien-terrence j grid.416153.4 6

塞缪尔结果。

lname_forename_long grid_id_ct grid_id name_seq group_seq links link_counts is_intersect_links
obrien-t j 1 grid.1002.3 1 1 1,3,4 3 1,3,4
obrien-terence 2 grid.1008.9 1 2 2,3,4 3 2,3,4
obrien-terence j 14 grid.1002.3 1 3 1,3,4 3 1,3,4
obrien-terence john 1 grid.1002.3 1 4 1,3,4 3 1,3,4
obrien-terry 2 grid.237081.f 2 7 3,4,7 3
obrien-timothy 1 grid.496867.2 1 8 8 1

考虑以下方法

with temp as (
  select *, array_length(split(links)) link_counts, 
    array_length(split(links)) < max(array_length(split(links))) over() merge_candidate
  from (
    select *, if(count(*) over win > 1, string_agg('' || group_seq) over win, '') links
    from t1
    window win as (partition by grid_id)
  )
  qualify 1 = row_number() over(partition by group_seq order by array_length(split(links)) desc, grid_id_ct desc)
)
select lname_forename_long, grid_id, merge_candidate as is_merged  
from temp where not merge_candidate
union all
select lname_forename_long, ifnull(merged_grid_id, grid_id), if(merged_grid_id is null, false, true) 
from (
  select any_value(t1).*, 
    any_value(( select t2.grid_id
      from unnest(split(t1.links)) link
      join unnest(split(t2.links)) link
      using(link)
      limit 1
    )) as merged_grid_id
  from (select * from temp where merge_candidate) t1
  cross join (select * from temp where not merge_candidate) t2
  group by to_json_string(t1)
)
order by grid_id desc, lname_forename_long desc    

如果应用于您问题中的示例数据 - 输出为