Greenplum 中的滚动(移动)中位数

Rolling (moving) median in Greenplum

我想为 Greenplum 中的列计算 滚动中位数,即如下所示:

|  x | rolling_median_x |
| -- + ---------------- |
|  4 |                4 |
|  1 |              2.5 |
|  3 |                3 |
|  2 |              2.5 |
|  1 |                2 |
|  6 |              2.5 |
|  9 |                3 |

x 是一个整数,对于每一行,rolling_median_x 显示当前行和前一行的 x 的中值。例如。第三行 rolling_median_x = median(4, 1, 3) = 3.

到目前为止我发现的事情:

事实上,我无法找到关于哪些函数可以用作 Greenplum 中的框架 window 函数的正确文档...

我正在使用 Greenplum 4.3.4.0(基于 Postgres 8.2.15),不幸的是无法更新。

一句话 - 引用自维基百科:ORDER BY

ORDER BY is the only way to sort the rows in the result set. Without this clause, the relational database system may return the rows in any order. If an ordering is required, the ORDER BY must be provided in the SELECT statement sent by the application. Although some database systems allow the specification of an ORDER BY clause in subselects or view definitions, the presence there has no effect. A view is a logical relational table, and the relational model mandates that a table is a set of rows, implying no sort order whatsoever.


由于您需要计算当前 和前面行 的中位数,因此您必须在 table 中有一个额外的行来定义 行的顺序,可用于确定哪些行在给定行之前,哪些行在给定行之后。
让像这样说一些 id 列:

| id | x | rolling_median_x |
|----|---|------------------|
|  1 | 4 |                4 |
|  2 | 1 |              2.5 |
|  3 | 3 |                3 |
|  4 | 2 |              2.5 |
|  5 | 1 |                2 |
|  6 | 6 |              2.5 |
|  7 | 9 |                3 |

如果您不能使用解析函数,请尝试纯 SQL.
This article 展示了使用 SQL.
计算中位数的各种方法 我认为 Henderson 的中位数最适合我们的需要:

SELECT CASE COUNT(*) % 2
       WHEN 0        -- even sized table
       THEN (P1.part_wgt + MIN(CASE WHEN P2.part_wgt > P1.part_wgt
                                  THEN P2.part_wgt
                                  ELSE NULL END))/2.0
       ELSE P1.part_wgt --odd sized table
       END AS median 
  FROM Parts AS P1, Parts AS P2
 GROUP BY P1.part_wgt
HAVING COUNT(CASE WHEN P1.part_wgt >= P2.part_wgt
                  THEN 1
                  ELSE NULL END)
       = (COUNT(*) + 1) / 2;

只是运行这个查询把每一行作为一个依赖子查询,大体思路是这样的:

SELECT t.*, (
        SELECT .... Henderson's query FROM table x
        WHERE x.id <= t.id
        ......
       ) As our_median
FROM table t

您可以在 this demo

中找到示例实现
SELECT t.*, (
    SELECT CASE COUNT(*) % 2
           WHEN 0        -- even sized table
           THEN (P1.x + MIN(CASE WHEN P2.x > P1.x
                                      THEN P2.x
                                      ELSE NULL END))/2.0
           ELSE P1.x --odd sized table
           END AS median 
      FROM Table333 AS P1, Table333 AS P2
      WHERE p1.id <= t.id AND p2.id <= t.id
     GROUP BY P1.x
    HAVING COUNT(CASE WHEN P1.x >= P2.x
                      THEN 1
                      ELSE NULL END)
           = (COUNT(*) + 1) / 2
    ) as Our_median
FROM Table333 t;

| id | x | rolling_median_x | our_median |
|----|---|------------------|------------|
|  1 | 4 |                4 |          4 |
|  2 | 1 |              2.5 |        2.5 |
|  3 | 3 |                3 |          3 |
|  4 | 2 |              2.5 |        2.5 |
|  5 | 1 |                2 |          2 |
|  6 | 6 |              2.5 |        2.5 |
|  7 | 9 |                3 |          3 |

这个查询可能会很慢 - 这是您使用旧版 Postgre 必须付出的代价SQL

I'm using psql 8.2.15 and updating is not an option unfortunately.

哎呀

如果是滚动均值,事情就简单了,但是滚动中位数会很慢,因为需要排序。避免这种情况的方法是将值插入到堆或 btree 中,这样可以在不对每个新值进行排序的情况下获得滚动中值。但这需要自定义代码。

我会使用 plpython 来实现这个:

Rolling median algorithm in C