在 table 列中查找每组中出现频率最高的值

Find the most frequent value per group in a table column

我需要为每个种族找到 object_of_search 的最常见值。我怎样才能做到这一点? SELECT 子句中的子查询和相关子查询是不允许的。类似于此:

mode() WITHIN GROUP (ORDER BY stopAndSearches.object_of_search) AS "Most frequent object of search"

但这并没有汇总,每个种族都有很多行,object_of_search:

 officer_defined_ethnicity | Sas for ethnicity |   Arrest rate    | Most frequent object of search
---------------------------+-------------------+------------------+--------------------------------
 ethnicity2                |                 3 | 66.6666666666667 | Stolen goods
 ethnicity3                |                 2 |              100 | Fireworks
 ethnicity1                |                 5 |               60 | Firearms
 ethnicity3                |                 2 |              100 | Firearms
 ethnicity1                |                 5 |               60 | Cat
 ethnicity1                |                 5 |               60 | Dog
 ethnicity2                |                 3 | 66.6666666666667 | Firearms
 ethnicity1                |                 5 |               60 | Psychoactive substances
 ethnicity1                |                 5 |               60 | Fireworks

应该是这样的:

 officer_defined_ethnicity | Sas for ethnicity |   Arrest rate    | Most frequent object of search
---------------------------+-------------------+------------------+--------------------------------
 ethnicity2                |                 3 | 66.6666666666667 | Stolen goods
 ethnicity3                |                 2 |              100 | Fireworks
 ethnicity1                |                 5 |               60 | Firearms

Table fiddle
查询:

SELECT DISTINCT
    stopAndSearches.officer_defined_ethnicity,
    count(stopAndSearches.sas_id) OVER(PARTITION BY stopAndSearches.officer_defined_ethnicity) AS "Sas for ethnicity",
    sum(case when stopAndSearches.outcome = 'Arrest' then 1 else 0 end)
       OVER (PARTITION BY stopAndSearches.officer_defined_ethnicity)::float /
       count(stopAndSearches.sas_id) OVER(PARTITION BY stopAndSearches.officer_defined_ethnicity)::float * 100 AS "Arrest rate",
    mode() WITHIN GROUP (ORDER BY stopAndSearches.object_of_search) AS "Most frequent object of search"
FROM stopAndSearches
GROUP BY stopAndSearches.sas_id, stopAndSearches.officer_defined_ethnicity;

Table:

CREATE TABLE IF NOT EXISTS stopAndSearches(
    "sas_id" bigserial PRIMARY KEY,
    "officer_defined_ethnicity" VARCHAR(255),
    "object_of_search" VARCHAR(255),
    "outcome" VARCHAR(255)
);

更新:Fiddle

这应该解决具体的“每个种族的对象”问题。

请注意,这并未解决计数中的关系问题。那不是问题/请求的一部分。

调整您的 SQL 以包含此逻辑,以提供详细信息:

WITH cte AS (
        SELECT officer_defined_ethnicity
             , object_of_search
             , COUNT(*) AS n
             , ROW_NUMBER() OVER (PARTITION BY officer_defined_ethnicity ORDER BY COUNT(*) DESC) AS rn
          FROM stopAndSearches
         GROUP BY officer_defined_ethnicity, object_of_search
     )
SELECT * FROM cte
 WHERE rn = 1
;

结果:

officer_defined_ethnicity object_of_search n rn
ethnicity1 Cat 1 1
ethnicity2 Stolen goods 2 1
ethnicity3 Fireworks 1 1
SELECT DISTINCT ON (1)
       officer_defined_ethnicity, object_of_search, count(*) AS ct
FROM   stop_and_searches
GROUP  BY 1, 2
ORDER  BY 1, 3 DESC, 2;

或更明确地说:

SELECT DISTINCT ON (officer_defined_ethnicity)
       officer_defined_ethnicity, object_of_search, count(*) AS ct
FROM   stop_and_searches
GROUP  BY officer_defined_ethnicity, object_of_search
ORDER  BY officer_defined_ethnicity, ct DESC, object_of_search;
 officer_defined_ethnicity | object_of_search | ct
---------------------------+------------------+----
 ethnicity1                | Cat              | 1
 ethnicity2                | Stolen goods     | 2
 ethnicity3                | Firearms         | 1

db<>fiddle here

因为 DISTINCT ON 应用 after GROUP BY 我们只需要一个查询级别。

  1. 聚合以获得每个 (officer_defined_ethnicity, object_of_search)GROUP BY 的计数。
  2. DISTINCT ON 选择每个 officer_defined_ethnicity 计数最高的行。

我添加了 object_of_search 作为第三个 ORDER BY 项来充当决胜局并产生确定性结果:
如果出现平局,请根据字母顺序选择第一个 object_of_search
适应您的需求。

参见:

  • Select first row in each GROUP BY group?
  • Best way to get result count before LIMIT was applied

row_number():

的子查询更简单且通常更快
  • Select first row in each GROUP BY group? - Benchmarks