从 col 中查找出现次数最多的值并将空值替换为该值

Question

我有 table 如下所示：

我想将 tier_1 和 tier_2 的空值分别替换为频率最高的值 tier_1 和 tier_2。按名称分组。

如下所示：

同样必须找到其他名称 B 和 C。我尝试了以下查询，它给出了正确的计数：

select tier_1,name,max(tr_cnt) from (
select name,tier_1, count(tier_1) as tr_cnt from `my_table` group by tier_1,name having count(tier_1) >=1
)group by name,tier_1

上面的代码在一定程度上确实有效，但是当我在查询中添加 tier_2 时：它说，标量子查询产生了不止一个元素

有没有办法实现这一点并用该列中出现的最大值替换空值？

Answer 1

使用COUNT()window函数获取每个name和tier_x组合的个数，使用FIRST_VALUE()window函数选择tier_x 出现次数最多的是：

WITH cte AS (
  SELECT *,
         COUNT(tier_1) OVER (PARTITION BY name, tier_1) counter1,
         COUNT(tier_2) OVER (PARTITION BY name, tier_2) counter2  
  FROM my_table
)
SELECT name,
       COALESCE(tier_1, FIRST_VALUE(tier_1) OVER (PARTITION BY name ORDER BY counter1 DESC)) tier_1,
       COALESCE(tier_2, FIRST_VALUE(tier_2) OVER (PARTITION BY name ORDER BY counter2 DESC)) tier_2
FROM cte;

请注意，在平局的情况下，您将获得任意值作为顶部 tier_x。
如果您想打破这种关系，您可以在 FIRST_VALUE() window 函数的 ORDER BY 子句中使用更多级别的排序：

WITH cte AS (
  SELECT *,
         COUNT(tier_1) OVER (PARTITION BY name, tier_1) counter1,
         COUNT(tier_2) OVER (PARTITION BY name, tier_2) counter2  
  FROM my_table
)
SELECT name,
       COALESCE(tier_1, FIRST_VALUE(tier_1) OVER (PARTITION BY name ORDER BY counter1 DESC, tier_1)) tier_1,
       COALESCE(tier_2, FIRST_VALUE(tier_2) OVER (PARTITION BY name ORDER BY counter2 DESC, tier_2)) tier_2
FROM cte;

参见demo。

Answer 2

考虑以下选项 (BigQuery)

select name, 
  ifnull(tier_1, max_tier_1) tier_1,
  ifnull(tier_2, max_tier_2) tier_2
from your_table t
left join (
  select name, 
    approx_top_sum(tier_1, if(tier_1 is null, 0, 1), 1)[offset(0)].value max_tier_1,
    approx_top_sum(tier_2, if(tier_2 is null, 0, 1), 1)[offset(0)].value max_tier_2
  from your_table 
  group by name
)
using (name)

如果应用于您问题中的示例数据 - 输出为

注意近似聚合函数的使用APPROX_TOP_SUM

Approximate aggregate functions are scalable in terms of memory usage and time, but produce approximate results instead of exact results. These functions typically require less memory than exact aggregation functions like COUNT(DISTINCT ...), but also introduce statistical uncertainty. This makes approximate aggregation appropriate for large data streams for which linear memory usage is impractical, as well as for data that is already approximate.

从 col 中查找出现次数最多的值并将空值替换为该值

Find highest number of occurring values from col and replace the null with that value

mysql

sql

common-table-expression

window-functions

google-bigquery