我想在 BigQuery 中删除具有表达式的特定列上的最大列值的记录

Question

company  | email | phone | website | address
Amar CO LLC | amar@gmail.com | 123 | NULL | India
Amar CO | amar@gmail.com | NULL | NULL | IND
Stacks CO | stack@gmail.com | 910 | stacks.com | United Kingdom
Stacks CO LLC | stack@gmail.com | NULL | NULL | UK

我想删除带有 CO LLC 的公司名称，而不是想保留 Amar CO 但想要 Amar CO LLC 中的所有列，因为它具有 minimum NULL 值或最大列数据。

简而言之：删除重复记录，删除带有'ending with or matching with LLC'（不区分大小写）的公司名称，但保留具有最大信息列的两条记录的值。

预期输出

Amar CO | amar@gmail.com | 123 | NULL | India
Stacks CO | stack@gmail.com | 910 | stacks.com | United Kingdom

Answer 1

您需要 group by 和 replace 如下：

select replace(company,' LLC','') as company, max(email) as email, max(phone) as phone,
       max(website) as website, max(address) as address
  from your_table t
group by replace(company,' LLC','')

我可以看到你需要两行的所有数据但是应该优先考虑LLC记录（India, IND --> India）然后你可以按如下方式使用它：

select t.company, 
       coalesce(tt.email,t.emial) as email, 
       coalesce(tt.phone,t.phone) as phone
       coalesce(tt.website,t.website) as website,
       coalesce(tt.address,t.address) as address
  from your_table t join your_table tt 
    on concat(t.company,' LLC') = tt.company

如果你想更新数据然后删除记录本身，我会建议以下 delete 和 update.

delete from your_table where t.company = 'Amar CO';

update your_table t
set t.comapny = replace(company,' LLC','') -- or use 'Amar CO'
where t.company = 'Amar CO LLC';

--更新

您想优先考虑具有最小空值的记录，那么您可以使用以下查询：

select t.company,
       case when tt_nulls > t_nulls then ttemail else temail end as email,
       case when tt_nulls > t_nulls then ttphone else tphone end as phone,
       case when tt_nulls > t_nulls then ttwebsite else twebsite end as website,
       case when tt_nulls > t_nulls then taddress else taddress end as address
from    
(select t.company, 
        count(case when t.email IS NULL THEN 1 end) over (partition by t.company) 
        + count(case when t.phone IS NULL THEN 1 end) over (partition by t.company) 
        + count(case when t.website IS NULL THEN 1 end) over (partition by t.company) 
        + count(case when t.address IS NULL THEN 1 end) over (partition by t.company)  
        as t_nulls,
        count(case when tt.email IS NULL THEN 1 end) over (partition by t.company) 
        + count(case when tt.phone IS NULL THEN 1 end) over (partition by t.company) 
        + count(case when tt.website IS NULL THEN 1 end) over (partition by t.company) 
        + count(case when tt.address IS NULL THEN 1 end) over (partition by t.company)  
        as tt_nulls
        t.email as temail, 
        t.phone as tphone,
        t.website as twebsite,
        t.address as taddress,
        tt.email as ttemail, 
        tt.phone as ttphone,
        tt.website as ttwebsite,
        tt.address as ttaddress
   from your_table t join your_table tt 
     on concat(t.company,' LLC') = tt.company) t

Answer 2

to give precedence to the record having minimum null values ...

以下适用于 BigQuery 标准 SQL（查询 #1）

#standardSQL
select 
  array_agg(t 
    order by array_length(regexp_extract_all(to_json_string(t), ':null')) 
    limit 1
  )[offset(0)].* 
  replace(regexp_replace(company, r'(?i)CO LLC', 'CO') as company) 
from `project.dataset.table` t
group by company

如果应用于您问题中的示例数据 - 输出为

In case if you want to fill all fields from all the records - you can use below (query#2)

select regexp_replace(company, r'(?i)CO LLC', 'CO') as company,
  max(email) email,
  max(phone) phone,
  max(website) website,
  max(address) address
from `project.dataset.table`
group by company

and finally - if you still want to give precedence to the record having minimum null values, but the rest of nulls replace with values from other rows - use below (query#3)

select company, 
  ifnull(email, max_email) email,
  ifnull(phone, max_phone) phone,
  ifnull(website, max_website) website,
  ifnull(address, max_address) address
from (
  select array_agg(t 
      order by array_length(regexp_extract_all(to_json_string(t), ':null')) 
      limit 1
    )[offset(0)].* 
    replace(regexp_replace(company, r'(?i)CO LLC', 'CO') as company),
    max(email) max_email, 
    max(phone) max_phone,
    max(website) max_website,
    max(address) max_address
  from `project.dataset.table` t
  group by company 
)

你可以 test/check 通过将它们应用到下面的虚拟数据来区分这个和以前的选项之间的区别

with `project.dataset.table` as (
  select 'Amar CO LLC' company, 'amar@gmail.com' email, 123 phone, NULL website, 'India' address union all
  select 'Amar CO', NULL, 222, 'amar.com', NULL union all
  select 'Stacks CO LLC', 'stack@gmail.com', NULL, NULL, 'UK' union all
  select 'Stacks CO', 'stack@gmil.com', 910, 'stacks.com', 'United Kingdom'
)

最后一个查询（查询#3）给出

而之前的（查询#2）只会给出所有行的最大值

我想在 BigQuery 中删除具有表达式的特定列上的最大列值的记录

I want to de-dupe records in BigQuery with max column value on specific column with expression

sql

google-bigquery

bigquery-udf