PERCENTILE_CONT() returns 与输入参数无关的相同值

Question

我想获得 table

的第 5、50、95 个百分位数

SELECT col1, col2, col3, AVG(col4), STD(col4), 
    PERCENTILE_CONT(0.05) WITHIN GROUP (ORDER BY col4) 
        OVER (PARTITION BY col1, col2, col3) as 5th_percentile, 
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY col4)  
        OVER (PARTITION BY col1, col2, col3) as 50th_percentile, 
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY col4)  
        OVER (PARTITION BY col1, col2, col3) as 95th_percentile
FROM table
GROUP BY col1, col2, col3
LIMIT 100

我最终得到的结果是 5th_percentile == 50th_percentile == 95th_percentile

AVG(col4)   STD(col4)   5th_percentile   50th_percentile  95th_percentile
300.000000  0.000000    300.000000       300.000000       300.000000
67.076600   16.968851   82.031792        82.031792        82.031792
66.166136   11.452172   78.348846        78.348846        78.348846
544.262809  68.269014   605.797302       605.797302       605.797302
22.523138   1.820358    24.000000        24.000000        24.000000

怎么回事？

编辑：数据库是 MemSQL

Answer 1

PERCENTILE_CONT() -- 至少在某些数据库中 -- 可以是聚合函数或 window 函数。

我认为正在发生的事情是在聚合之后计算值 - 我不确定为什么。老实说，我预计代码会出现语法错误，因为 col4 没有聚合。换句话说，(ORDER BY MAX(col4)) 应该有效，但 (ORDER BY col4) 无效，因为百分位数是在聚合后 计算的。

但试试不带 OVER 子句：

SELECT col1, col2, col3, AVG(col4), STD(col4), PERCENTILE_CONT(0.05) WITHIN GROUP (ORDER BY col4) as 5th_percentile, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY col4) as 50th_percentile, PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY col4) as 95th_percentile FROM table GROUP BY col1, col2, col3 LIMIT 100;

编辑：

您的数据库似乎不支持 PERCENTILE_CONT() 作为聚合函数。不考虑口味。大部分都是。

解决方法是SELECT DISTINCT：

SELECT DISTINCT col1, col2, col3, AVG(col4) OVER (PARTITION BY col1, col2, col3), STD(col4) OVER (PARTITION BY col1, col2, col3), PERCENTILE_CONT(0.05) WITHIN GROUP (ORDER BY col4) OVER (PARTITION BY col1, col2, col3) as 5th_percentile, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY col4) OVER (PARTITION BY col1, col2, col3) as 50th_percentile, PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY col4) OVER (PARTITION BY col1, col2, col3) as 95th_percentile FROM table LIMIT 100;

或者使用子查询。

Answer 2

WITH a AS (
SELECT col1, col2, col3, 
        PERCENTILE_CONT(0.05) WITHIN GROUP (ORDER BY col4) 
            OVER (PARTITION BY col1, col2, col3) as 5th_percentile,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY col4) 
            OVER (PARTITION BY col1, col2, col3) as 50th_percentile,
        PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY col4) 
            OVER (PARTITION BY col1, col2, col3) as 95th_percentile
FROM table
)
SELECT DISTINCT col1, col2, col3, 5th_percentile, 50th_percentile, 95th_percentile
FROM a
LIMIT 100

这有效，看起来你不能用 percentile_cont

进行分组

Answer 3

Window 函数在 GROUP BY 子句之后运行。 GROUP BY 每组生成一行，这就是为什么 PERCENTILE_CONT window 函数都 return 相同的值。

您想先计算 window 函数，然后再计算 GROUP BY。您可以通过将 window 函数放在内部子 select 中并将 GROUP BY 放在外部 select.

中来实现

这是来自 postgres 的文档，它解释了 window 函数如何与分组依据相关（这是标准的 ANSI SQL，MemSQL 做同样的事情）：

https://www.postgresql.org/docs/current/static/tutorial-window.html

The rows considered by a window function are those of the "virtual table" produced by the query's FROM clause as filtered by its WHERE, GROUP BY, and HAVING clauses if any. For example, a row removed because it does not meet the WHERE condition is not seen by any window function. A query can contain multiple window functions that slice up the data in different ways by means of different OVER clauses, but they all act on the same collection of rows defined by this virtual table.

请注意，在 MemSQL 中，如果您使用未分组或聚合的列（例如查询中的 col4），您会从组中的行中获得任意值，即它表现得像 ANY_VALUE 聚合。在 MemSQL 的未来版本中，此查询将改为 return 错误，以帮助您避免编写具有此类意外行为的查询。

PERCENTILE_CONT() returns 与输入参数无关的相同值

PERCENTILE_CONT() returns same value regardless of input parameter

sql

singlestore