Percentile_disc() 对于非整数值

Question

我正在努力寻找解决方案，但运气不佳。

在我的查询中，我 select 使用 count(*) 和 percentile_disc(.9) 来找到它的第 90 个位置。情况是，当计数为 29 时，第 90 个百分位数比 27 更接近数字 26，但仍返回第 27 个对象。

有什么办法可以说，如果 5 < Nth <10 将结果减一？

Table供参考

ID    Count    90th
-------------------
1     50       45
2     40       36
3     27       25     <-- Should be 24
4      9        9     <-- Should be  8

9 的 90% 是 0.9，应该删除 1 并得到 8。

---这是我对第N个百分位的理解---

现在我有：

我的 table 丢失了条目（每天 + 100k），所以我想运行每天进行此查询。

Service_id   start_time      end_time
-------------------------------------
Service1    1499025651614    1499025651648
Service2    1499025655145    1499025655434
Service3    1499025656029    1499025656112
Service2    1499025658755    1499025659135
Service3    1499025726862    1499025728346
Service1    1499025748782    1499025750032
Service3    1499025749277    1499025749900
Service3    1499025757681    1499025758517
Service2    1499025775000    1499025775101
Service1    1499025785556    1499025785633
...

我想查询 select 每个服务的最小值、最大值和平均值

 select mt.SERVICE_ID as SERVICE_ID,
           count(*) as COUNT,
           round(avg((mt.end_time - mt.start_time) / 1000), 2) as Avg,
           round(min((mt.end_time - mt.start_time) / 1000), 2) AS Min,
           round(max((mt.end_time - mt.start_time) / 1000), 2) AS Max
      from myTable mt
     group by mt.service_id

我想合并使用连接之前讨论的第 90 个百分位数。

select service_id, round(percentile_disc(.90) within group(order by elapsed), 2) as perc
from (select mt.service_id, ((mt.end_time - mt.start_time) / 1000) as elapsed
      from myTable mt)
group by service_id

当计数为（比方说）9 时出现问题，在这种情况下，MAX 和 Perc 是相同的（因为百分位数没有删除任何东西）但我需要在这种特殊情况下，删除最后一个，结果给我第 8 位的时间。

这种情况下有什么办法可以再去掉一个位置吗？

Answer 1

PERCENTILE_DISC() 并不像您想象的那样。

Oracle Documentation:

Purpose

PERCENTILE_DISC is an inverse distribution function that assumes a discrete distribution model. It takes a percentile value and a sort specification and returns an element from the set. Nulls are ignored in the calculation.

...

For a given percentile value P, PERCENTILE_DISC sorts the values of the expression in the ORDER BY clause and returns the value with the smallest CUME_DIST value (with respect to the same sort specification) that is greater than or equal to P.

Analytic Example

The following example calculates the median discrete percentile of the salary of each employee in the sample table hr.employees:
SELECT last_name, salary, department_id,
   PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY salary DESC)
      OVER (PARTITION BY department_id) "Percentile_Disc",
   CUME_DIST() OVER (PARTITION BY department_id 
      ORDER BY salary DESC) "Cume_Dist"
FROM employees where department_id in (30, 60);

LAST_NAME         SALARY DEPARTMENT_ID Percentile_Disc  Cume_Dist
------------- ---------- ------------- --------------- ----------
Raphaely           11000            30            2900 .166666667
Khoo                3100            30            2900 .333333333
Baida               2900            30            2900         .5
Tobias              2800            30            2900 .666666667
Himuro              2600            30            2900 .833333333
Colmenares          2500            30            2900          1
Hunold              9000            60            4800         .2
Ernst               6000            60            4800         .4
Austin              4800            60            4800         .8
Pataballa           4800            60            4800         .8
Lorentz             4200            60            4800          1
The median value for Department 30 is 2900, which is the value whose corresponding percentile (Cume_Dist) is the smallest value greater than or equal to 0.5. The median value for Department 60 is 4800, which is the value whose corresponding percentile is the smallest value greater than or equal to 0.5.

在文档中给出的示例中，如果百分位数设置为 0.9（而不是 0.5），那么您可以看到 CUME_DIST 来自 0.8 到 1（对于部门 60）所以 PERCENTILE_DISC(0.9) ... 会给出 4200，因为这是最小 CUME_DIST 大于或等于 0.9 的值。在这种情况下，要获得倒数第二个值，您需要 0.8 的百分位数。

The issue comes when the count is (lets say) 9, in this case, the MAX and the Perc is the same (due the percentile is not removing anything) but I need in this particular case, to remove the last one, giving me as result the timing in the position 8th.

对于 9 个项目，每行的 CUME_DIST 值为：

ROW_NUMBER CUME_DIST
---------- ---------
         1      .111
         2      .222
         3      .333
         4      .444
         5      .556
         6      .667
         7      .778
         8      .889
         9     1.000

如果您使用 PERCENTILE_DISC( 0.9 ) 那么它会查找具有大于或等于该值的最低 CUME_DIST 的值 - 只有一个值 1.000 是也是最大值。

如果您想要不同的值，则需要使用较低的百分位数。

更新:

你可以试试这样：

select service_id, 
       elapsed as perc
from (
  select service_id,
         (end_time - start_time) / 1000 as elapsed,
         ROW_NUMBER() OVER ( PARTITION BY service_id ORDER BY (end_time - start_time) )
           AS rn,
         COUNT() OVER ( PARTITION BY service_id ) AS ct
  from   myTable
)
WHERE rn = ROUND( 0.9 * ct );

更改最后一行以根据您的业务逻辑使用 ROUND、FLOOR 或 CEIL。如果我正确地确定了逻辑，CEIL 将给出与使用 PERCENTILE_DISC.

相同的答案

What I need is the count is 7, remove the last record and return the 6th value (90% of 7 is 0.7 , round to 1), is the count is 21, remove the last 2 records and return the 19th position-value (90% of 21 is 2.1 round to 2) and so on.

使用rn = ROUND( 0.9 * ct ):

如果计数是 7 那么 0.9 * 7 = 6.3 所以 ROUND( 6.3 ) 将给出第 6 行
如果计数是 21 那么 0.9 * 21 = 18.9 所以 ROUND( 18.9 ) 将给出第 19 行
如果计数为 3，则 0.9 * 3 = 2.7 所以 ROUND( 2.7 ) 将给出第 3 行（最大值）。

目前还不清楚您希望为小集合返回什么 - 如果您从不想要最大行（除非只有一行），那么类似于：

WHERE rn = GREATEST( 1, LEAST( ct - 1, ROUND( 0.9 * ct ) ) )

Percentile_disc() 对于非整数值

Percentile_disc() for a non round values

oracle

percentile