Return 由一个列 ID 聚合并数据对齐到另一个列 ID 的数据

Return data aggregated by one column id and data-aligned to another column id

我有一个很大的 table,其中包含数百万行,如下所示:

CREATE TABLE mytable (
    row_id bigint,
    col_id bigint,
    value double precision,
    timestamp timestamp
);

给定:

  1. list_row = row_id 的列表(如果需要可以订购)
  2. list_col = col_id 的列表(同样,如果需要可以订购)
  3. 两个列表可能都非常大(可能有几十万)
  4. 上面的table可能有好几百万条

我如何(有效地)return 资源,其中:

  1. 列是 list_col 中出现的所有 col_id,并且出现的顺序与 list_col
  2. 中出现的 col_id 的顺序相同
  3. 行是 list_row 中出现的所有 row_id(它们不需要以相同的顺序出现)
  4. 每个字段包含给定 row_idcol_idvalue
  5. 我们只对任何 row_id:col_id 对的最近记录的 value 感兴趣,即使用 MAX(timestamp) 或类似过滤器的东西
  6. 在结果中,如果给定的 row_id:col_id 坐标没有记录 value,那么该字段应该是 null.

一个直观的例子来说明。初始table:

+--------+--------+-------+-----------+
| row_id | col_id | value | timestamp |
+========+========+=======+===========+
|   10   |   20   |  100  | 2016-0... |
|   10   |   21   |  200  | 2015-0... |
|   11   |   20   |  300  | 2016-1... |
|   11   |   22   |  400  | 2016-0... |
+--------+--------+-------+-----------+

变成:

                  col_id →
            +-----------------+
            | 20  | 21  | 22  |
            +=====+=====+=====+
row_id (10) | 100 | 200 |     |
   ↓   (11) | 300 |     | 400 |
            +-----+-----+-----+

我怀疑正确的答案是首先创建一个临时的 table,将目标 col_id 作为列,然后进行某种连接。我不知道如何有效地做到这一点。是否可以在不需要为每个 row_id 临时 table 的情况下执行此操作?

crosstab() 适用于常规查询:

  • PostgreSQL Crosstab Query

但是适合你的情况,因为:

  1. Both lists may be very large (maybe 10s of thousands)

Postgres 的列太多了。 The manual:

There is a limit on how many columns a table can contain. Depending on the column types, it is between 250 and 1600. However, defining a table with anywhere near this many columns is highly unusual and often a questionable design.

我建议改用 return arrays。类似于(适用于任何现代 Postgres 版本):

SELECT row_id
     , array_agg(col_id) AS cols
     , array_agg(value)  AS vals
FROM  (
   SELECT DISTINCT ON (row_id, col_id)  --  most recent values for row_id:col_id pair 
          row_id, col_id, value
   FROM   mytable
   WHERE  row_id IN (<long list>)
   AND    col_id IN (<long list>)
   ORDER  BY row_id, col_id, timestamp DESC
   ) sub
GROUP   BY 1;

关于DISTINCT ON

  • Select first row in each GROUP BY group?

return 数据的几种替代方法:

SELECT json_agg(json_build_object('col_id', col_id
                                , 'value' , value)) AS col_values1  -- requires pg 9.4+
     , json_agg(json_build_object(col_id, value))   AS col_values2  -- requires pg 9.4+
     , array_agg(ARRAY[col_id, value])              AS col_values3  -- requires pg 9.5+
     , array_agg(hstore(col_id::text, value::text)) AS col_values4  -- requires pg 8.3+
FROM  ...  -- same as above

最后一个需要附加模块 hstore