在 AWS Redshift 中使用 Group By 计算中位数

Calculating median with Group By in AWS Redshift

我见过 other posts about using the median() window function in Redshift,但是您如何将它用于末尾有分组依据的查询?

例如,假设 table 课程:

Course | Subject | Num_Students
-------------------------------
   1   |  Math   |      4
   2   |  Math   |      6
   3   |  Math   |      10
   4   | Science |      2
   5   | Science |      10
   6   | Science |      12

我想获得每个课程科目的学生人数中位数。我将如何编写给出以下结果的查询:

  Subject  | Median
-----------------------
 Math      |     6
 Science   |     10

我试过:

SELECT
subject, median(num_students) over ()
FROM
course
GROUP BY 1
;

但它列出了该主题的每次出现以及跨主题的相同中位数(这是假数据,因此它的实际值 returns 不是 6,只是显示它在所有主题中都是相同的) :

  Subject  | Median
-----------------------
 Math      |     6
 Math      |     6
 Math      |     6
 Science   |     6
 Science   |     6
 Science   |     6

您尚未在 window 中定义分区。而不是 OVER() 你需要 OVER(PARTITION BY subject).

以下内容将为您提供所需的准确结果:

SELECT distinct
subject, median(num_students) over(partition by Subject) 
FROM
course
order by Subject;

假设您想按主题计算其他聚合,例如 avg(), 你需要使用子查询:

WITH subject_numstudents_medianstudents AS (
    SELECT
        subject
        , num_students
        , median(num_students) over (partition BY subject) AS median_students
    FROM
        course
)
SELECT
    subject
    , median_students
    , avg(num_students) as avg_students
FROM subject_numstudents_medianstudents
GROUP BY 1, 2

您只需删除其中的 "over()" 部分。

SELECT subject, median(num_students) FROM course GROUP BY 1;