Return 由一个列 ID 聚合并数据对齐到另一个列 ID 的数据
Return data aggregated by one column id and data-aligned to another column id
我有一个很大的 table,其中包含数百万行,如下所示:
CREATE TABLE mytable (
row_id bigint,
col_id bigint,
value double precision,
timestamp timestamp
);
给定:
list_row
= row_id
的列表(如果需要可以订购)
list_col
= col_id
的列表(同样,如果需要可以订购)
- 两个列表可能都非常大(可能有几十万)
- 上面的table可能有好几百万条
我如何(有效地)return 资源,其中:
- 列是
list_col
中出现的所有 col_id
,并且出现的顺序与 list_col
中出现的 col_id
的顺序相同
- 行是
list_row
中出现的所有 row_id
(它们不需要以相同的顺序出现)
- 每个字段包含给定
row_id
和 col_id
的 value
。
- 我们只对任何
row_id:col_id
对的最近记录的 value
感兴趣,即使用 MAX(timestamp
) 或类似过滤器的东西
- 在结果中,如果给定的
row_id:col_id
坐标没有记录 value
,那么该字段应该是 null
.
一个直观的例子来说明。初始table:
+--------+--------+-------+-----------+
| row_id | col_id | value | timestamp |
+========+========+=======+===========+
| 10 | 20 | 100 | 2016-0... |
| 10 | 21 | 200 | 2015-0... |
| 11 | 20 | 300 | 2016-1... |
| 11 | 22 | 400 | 2016-0... |
+--------+--------+-------+-----------+
变成:
col_id →
+-----------------+
| 20 | 21 | 22 |
+=====+=====+=====+
row_id (10) | 100 | 200 | |
↓ (11) | 300 | | 400 |
+-----+-----+-----+
我怀疑正确的答案是首先创建一个临时的 table,将目标 col_id
作为列,然后进行某种连接。我不知道如何有效地做到这一点。是否可以在不需要为每个 row_id
临时 table 的情况下执行此操作?
crosstab()
适用于常规查询:
- PostgreSQL Crosstab Query
但是不适合你的情况,因为:
- Both lists may be very large (maybe 10s of thousands)
Postgres 的列太多了。 The manual:
There is a limit on how many columns a table can contain. Depending on
the column types, it is between 250 and 1600. However, defining a
table with anywhere near this many columns is highly unusual and often
a questionable design.
我建议改用 return arrays。类似于(适用于任何现代 Postgres 版本):
SELECT row_id
, array_agg(col_id) AS cols
, array_agg(value) AS vals
FROM (
SELECT DISTINCT ON (row_id, col_id) -- most recent values for row_id:col_id pair
row_id, col_id, value
FROM mytable
WHERE row_id IN (<long list>)
AND col_id IN (<long list>)
ORDER BY row_id, col_id, timestamp DESC
) sub
GROUP BY 1;
关于DISTINCT ON
:
- Select first row in each GROUP BY group?
return 数据的几种替代方法:
SELECT json_agg(json_build_object('col_id', col_id
, 'value' , value)) AS col_values1 -- requires pg 9.4+
, json_agg(json_build_object(col_id, value)) AS col_values2 -- requires pg 9.4+
, array_agg(ARRAY[col_id, value]) AS col_values3 -- requires pg 9.5+
, array_agg(hstore(col_id::text, value::text)) AS col_values4 -- requires pg 8.3+
FROM ... -- same as above
最后一个需要附加模块 hstore
。
我有一个很大的 table,其中包含数百万行,如下所示:
CREATE TABLE mytable (
row_id bigint,
col_id bigint,
value double precision,
timestamp timestamp
);
给定:
list_row
=row_id
的列表(如果需要可以订购)list_col
=col_id
的列表(同样,如果需要可以订购)- 两个列表可能都非常大(可能有几十万)
- 上面的table可能有好几百万条
我如何(有效地)return 资源,其中:
- 列是
list_col
中出现的所有col_id
,并且出现的顺序与list_col
中出现的 - 行是
list_row
中出现的所有row_id
(它们不需要以相同的顺序出现) - 每个字段包含给定
row_id
和col_id
的value
。 - 我们只对任何
row_id:col_id
对的最近记录的value
感兴趣,即使用MAX(timestamp
) 或类似过滤器的东西 - 在结果中,如果给定的
row_id:col_id
坐标没有记录value
,那么该字段应该是null
.
col_id
的顺序相同
一个直观的例子来说明。初始table:
+--------+--------+-------+-----------+
| row_id | col_id | value | timestamp |
+========+========+=======+===========+
| 10 | 20 | 100 | 2016-0... |
| 10 | 21 | 200 | 2015-0... |
| 11 | 20 | 300 | 2016-1... |
| 11 | 22 | 400 | 2016-0... |
+--------+--------+-------+-----------+
变成:
col_id →
+-----------------+
| 20 | 21 | 22 |
+=====+=====+=====+
row_id (10) | 100 | 200 | |
↓ (11) | 300 | | 400 |
+-----+-----+-----+
我怀疑正确的答案是首先创建一个临时的 table,将目标 col_id
作为列,然后进行某种连接。我不知道如何有效地做到这一点。是否可以在不需要为每个 row_id
临时 table 的情况下执行此操作?
crosstab()
适用于常规查询:
- PostgreSQL Crosstab Query
但是不适合你的情况,因为:
- Both lists may be very large (maybe 10s of thousands)
Postgres 的列太多了。 The manual:
There is a limit on how many columns a table can contain. Depending on the column types, it is between 250 and 1600. However, defining a table with anywhere near this many columns is highly unusual and often a questionable design.
我建议改用 return arrays。类似于(适用于任何现代 Postgres 版本):
SELECT row_id
, array_agg(col_id) AS cols
, array_agg(value) AS vals
FROM (
SELECT DISTINCT ON (row_id, col_id) -- most recent values for row_id:col_id pair
row_id, col_id, value
FROM mytable
WHERE row_id IN (<long list>)
AND col_id IN (<long list>)
ORDER BY row_id, col_id, timestamp DESC
) sub
GROUP BY 1;
关于DISTINCT ON
:
- Select first row in each GROUP BY group?
return 数据的几种替代方法:
SELECT json_agg(json_build_object('col_id', col_id
, 'value' , value)) AS col_values1 -- requires pg 9.4+
, json_agg(json_build_object(col_id, value)) AS col_values2 -- requires pg 9.4+
, array_agg(ARRAY[col_id, value]) AS col_values3 -- requires pg 9.5+
, array_agg(hstore(col_id::text, value::text)) AS col_values4 -- requires pg 8.3+
FROM ... -- same as above
最后一个需要附加模块 hstore
。