为每对 Col1 和 Col2 找到最常出现的 Col3

Find most occurring Col3 for every pair of Col1 and Col2

给定一个有 4 列的 table myTable,假设 Col1Col2Col3Col4:

A X 5 B
A Y 5 C
A X 7 D
A Y 3 E 
A X 7 F

我需要为每对 (col1, col2).

找到出现次数最多的 col3

所以这个例子的结果将是:

A X 7   D/F  -- D or F
A Y 5/3 C/E  -- It can be 5 and C or 3 and E

所以我写了一个类似这样的查询:

select Col1,Col2,Col3 
from myTable M 
group by Col1,Col2,Col3 
having Col3 =
     (select Col3 
      from myTable N 
      where M.Col1=N.col1 
      group by Col3 
      order by Col3 desc limit 1); 

但是查询没有给出想要的结果。
此外,我不知道如何获得 Col4 作为 group by 子句,我不想​​根据 Col4.

进行分组

对于每个 (Col1, Col2) 对,我想要单个 Col4 与出现的最大值 Col3

一种方法是在聚合查询之上使用 row_number() window 函数:

SELECT col1, col2, col3
FROM   (SELECT col1, col2, col3, 
        ROW_NUMBER () OVER (PARTITION BY col1, col2 ORDER BY cnt DESC) AS rn
        FROM (SELECT   col1, col2, col3, COUNT(*) AS cnt
              FROM     mytable
              GROUP BY col1, col2, col3) t
       ) q
WHERE  rn = 1

你只需要一个带有 DISTINCT ON:

的子查询
SELECT DISTINCT ON (col1, col2)
       col1, col2, col3, min(col4) As col4
FROM   tbl
GROUP  BY col1, col2, col3
ORDER  BY col1, col2, count(*) DESC, col3;

通过这种方式,每个 (col1, col2) 得到一个 单行 最常见 col3(“最常见”的多个并列的最小值)和最小 col4col3 一致。


类似地,要使 all 符合条件 col3,您可以在子查询中使用 window function rank(),该子查询也会在 after聚合:

SELECT col1, col2, col3, col4_list
FROM  (
   SELECT col1, col2, col3, count(*) AS ct, string_agg(col4, '/') AS col4_list
        , rank() OVER (PARTITION BY col1, col2 ORDER BY count(*) DESC) AS rnk
   FROM   tbl
   GROUP  BY col1, col2, col3
   ) sub
WHERE  rnk = 1
ORDER  BY col1, col2, col3;

这行得通,因为您可以 运行 window 函数 over 聚合函数。
如果数据类型不是 character type.

,则转换为 text

或者,列表中每个 (col1, col2) 的所有符合条件 col3,加上第二个列表中所有匹配的 col4

SELECT col1, col2
     , string_agg(col3::text, '/') AS col3_list  -- cast if necessary
     , string_agg(col4_list,  '/') AS col4_list
FROM  (
   SELECT col1, col2, col3, count(*) AS ct, string_agg(col4, '/') AS col4_list
        , rank() OVER (PARTITION BY col1, col2 ORDER BY count(*) DESC) AS rnk
   FROM   tbl
   GROUP  BY col1, col2, col3
   ) sub
WHERE  rnk = 1
GROUP  BY col1, col2
ORDER  BY col1, col2, col3_list;

更多解释的相关答案:

  • Select first row in each GROUP BY group?
  • Best way to get result count before LIMIT was applied
  • Get the distinct sum of a joined table column

Amazon Redshift 解决方案

row_number() 可用,所以这应该有效:

SELECT col1, col2, col3, col4
FROM  (
   SELECT col1, col2, col3, min(col4) AS col4
        , row_number() OVER (PARTITION BY col1, col2
                             ORDER BY count(*) DESC, col3) AS rn
   FROM   tbl
   GROUP  BY col1, col2, col3
   ) sub
WHERE  rn = 1
ORDER  BY col1, col2;

或者如果 window 不允许聚合函数上的函数,使用另一个子查询

SELECT col1, col2, col3, col4
FROM  (
   SELECT *, row_number() OVER (PARTITION BY col1, col2
                                ORDER BY ct DESC, col3) AS rn
   FROM (
      SELECT col1, col2, col3, min(col4) AS col4, COUNT(*) AS ct
      FROM   tbl
      GROUP  BY col1, col2, col3
      ) sub1
   ) sub2
WHERE  rn = 1;

这会选择最小的 col3 如果超过一个并列为最大计数。而最小的col4为各自的col3.

SQL Fiddle 在 Postgres 9.3 中演示所有内容。