优化大 table 执行 generate_series() 的查询
Optimize query on big table executing generate_series()
以下查询在 PostgreSQL 11.1 中花费了 7 多分钟:
SELECT
'2019-01-19' as date,
'2019-01-19'::date - generate_series(first_observed, last_observed, interval '1 day')::date as days_to_date,
ROUND(AVG(price)) as price,
area_id
FROM
table_example
GROUP BY
days_to_date, area_id;
table_example
有大约 1500 万行.
有什么办法可以优化吗?我已经添加了以下索引:
CREATE INDEX ON table_example (first_observed, last_observed);
CREATE INDEX ON table_example (area_id);
这是EXPLAIN (ANALYZE,BUFFERS)
的输出:
GroupAggregate (cost=3235559683.68..3377398628.68 rows=1418000 width=72) (actual time=334933.966..440096.869 rows=21688 loops=1)
Group Key: (('2019-01-19'::date - ((generate_series((first_observed)::timestamp with time zone, (last_observed)::timestamp with time zone, '1 day'::interval)))::date)), area_id
Buffers: local read=118167 dirtied=118167 written=117143, temp read=1634631 written=1635058
-> Sort (cost=3235559683.68..3271009671.18 rows=14179995000 width=40) (actual time=334923.933..391690.184 rows=380203171 loops=1)
Sort Key: (('2019-01-19'::date - ((generate_series((first_observed)::timestamp with time zone, (last_observed)::timestamp with time zone, '1 day'::interval)))::date)), area_id
Sort Method: external merge Disk: 9187584kB
Buffers: local read=118167 dirtied=118167 written=117143, temp read=1634631 written=1635058
-> Result (cost=0.00..390387079.39 rows=14179995000 width=40) (actual time=214.798..171717.941 rows=380203171 loops=1)
Buffers: local read=118167 dirtied=118167 written=117143
-> ProjectSet (cost=0.00..71337191.89 rows=14179995000 width=44) (actual time=214.796..102823.749 rows=380203171 loops=1)
Buffers: local read=118167 dirtied=118167 written=117143
-> Seq Scan on table_example (cost=0.00..259966.95 rows=14179995 width=44) (actual time=0.031..2449.511 rows=14179995 loops=1)
Buffers: local read=118167 dirtied=118167 written=117143
Planning Time: 0.409 ms
JIT:
Functions: 18
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 5.034 ms, Inlining 13.010 ms, Optimization 121.440 ms, Emission 79.996 ms, Total 219.480 ms
Execution Time: 441133.410 ms
这就是 table_example 的样子:
column name data type
'house_pk' 'integer'
'date_in' 'date'
'first_observed' 'date'
'last_observed' 'date'
'price' 'numeric'
'area_id' 'integer'
有 60 个不同的 area_ids。
正在具有 128 GB 内存的多核计算机(24 核)上进行查询 运行。但是,设置可能不是最佳的。
在处理整个 table 时,索引通常是无用的(如果 table 行比索引宽得多,索引扫描可能除外)。
并且在处理整个 table 时,我认为查询本身的性能优化空间不大。一件小事:
SELECT d.the_date
, <b>generate_series(d.the_date - last_observed
, d.the_date - first_observed) AS days_to_date</b>
, round(avg(price)) AS price
, area_id
FROM table_example
, (SELECT date '2019-01-19') AS d(the_date)
GROUP BY days_to_date, area_id;
假设 first_observed
和 last_observed
是 date NOT NULL
并且总是 < date '2019-01-19'
。否则你需要投/做更多。
这样,您只有两次减法,然后 generate_series()
处理整数(最快)。
添加的迷你子查询只是为了方便,只提供一次日期。在准备好的语句或函数中,您可以使用参数并且不需要这个:
, (SELECT date '2019-01-19') AS d(the_date)
除此之外,如果 EXPLAIN (ANALYZE, BUFFERS)
提到 "Disk"(例如:Sort Method: external merge Disk: 3240kB
),那么 work_mem
的(临时)更高设置应该会有所帮助。参见:
- Configuration parameter work_mem in PostgreSQL on Linux
- Optimize simple query using ORDER BY date and text
如果您负担不起更多的 RAM,并且聚合 and/or 排序步骤仍然会溢出到磁盘,那么使用 LATERAL
连接这样的查询可能有助于分而治之:
SELECT d.the_date, f.*, a.area_id
FROM area a
, (SELECT date '2019-01-19') AS d(the_date)
, LATERAL (
SELECT generate_series(d.the_date - last_observed
, d.the_date - first_observed) AS days_to_date
, round(avg(price)) AS price
FROM table_example
WHERE area_id = a.area_id
GROUP BY 1
) f;
假设 table area
,显然。
以下查询在 PostgreSQL 11.1 中花费了 7 多分钟:
SELECT
'2019-01-19' as date,
'2019-01-19'::date - generate_series(first_observed, last_observed, interval '1 day')::date as days_to_date,
ROUND(AVG(price)) as price,
area_id
FROM
table_example
GROUP BY
days_to_date, area_id;
table_example
有大约 1500 万行.
有什么办法可以优化吗?我已经添加了以下索引:
CREATE INDEX ON table_example (first_observed, last_observed);
CREATE INDEX ON table_example (area_id);
这是EXPLAIN (ANALYZE,BUFFERS)
的输出:
GroupAggregate (cost=3235559683.68..3377398628.68 rows=1418000 width=72) (actual time=334933.966..440096.869 rows=21688 loops=1)
Group Key: (('2019-01-19'::date - ((generate_series((first_observed)::timestamp with time zone, (last_observed)::timestamp with time zone, '1 day'::interval)))::date)), area_id
Buffers: local read=118167 dirtied=118167 written=117143, temp read=1634631 written=1635058
-> Sort (cost=3235559683.68..3271009671.18 rows=14179995000 width=40) (actual time=334923.933..391690.184 rows=380203171 loops=1)
Sort Key: (('2019-01-19'::date - ((generate_series((first_observed)::timestamp with time zone, (last_observed)::timestamp with time zone, '1 day'::interval)))::date)), area_id
Sort Method: external merge Disk: 9187584kB
Buffers: local read=118167 dirtied=118167 written=117143, temp read=1634631 written=1635058
-> Result (cost=0.00..390387079.39 rows=14179995000 width=40) (actual time=214.798..171717.941 rows=380203171 loops=1)
Buffers: local read=118167 dirtied=118167 written=117143
-> ProjectSet (cost=0.00..71337191.89 rows=14179995000 width=44) (actual time=214.796..102823.749 rows=380203171 loops=1)
Buffers: local read=118167 dirtied=118167 written=117143
-> Seq Scan on table_example (cost=0.00..259966.95 rows=14179995 width=44) (actual time=0.031..2449.511 rows=14179995 loops=1)
Buffers: local read=118167 dirtied=118167 written=117143
Planning Time: 0.409 ms
JIT:
Functions: 18
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 5.034 ms, Inlining 13.010 ms, Optimization 121.440 ms, Emission 79.996 ms, Total 219.480 ms
Execution Time: 441133.410 ms
这就是 table_example 的样子:
column name data type
'house_pk' 'integer'
'date_in' 'date'
'first_observed' 'date'
'last_observed' 'date'
'price' 'numeric'
'area_id' 'integer'
有 60 个不同的 area_ids。
正在具有 128 GB 内存的多核计算机(24 核)上进行查询 运行。但是,设置可能不是最佳的。
在处理整个 table 时,索引通常是无用的(如果 table 行比索引宽得多,索引扫描可能除外)。
并且在处理整个 table 时,我认为查询本身的性能优化空间不大。一件小事:
SELECT d.the_date
, <b>generate_series(d.the_date - last_observed
, d.the_date - first_observed) AS days_to_date</b>
, round(avg(price)) AS price
, area_id
FROM table_example
, (SELECT date '2019-01-19') AS d(the_date)
GROUP BY days_to_date, area_id;
假设 first_observed
和 last_observed
是 date NOT NULL
并且总是 < date '2019-01-19'
。否则你需要投/做更多。
这样,您只有两次减法,然后 generate_series()
处理整数(最快)。
添加的迷你子查询只是为了方便,只提供一次日期。在准备好的语句或函数中,您可以使用参数并且不需要这个:
, (SELECT date '2019-01-19') AS d(the_date)
除此之外,如果 EXPLAIN (ANALYZE, BUFFERS)
提到 "Disk"(例如:Sort Method: external merge Disk: 3240kB
),那么 work_mem
的(临时)更高设置应该会有所帮助。参见:
- Configuration parameter work_mem in PostgreSQL on Linux
- Optimize simple query using ORDER BY date and text
如果您负担不起更多的 RAM,并且聚合 and/or 排序步骤仍然会溢出到磁盘,那么使用 LATERAL
连接这样的查询可能有助于分而治之:
SELECT d.the_date, f.*, a.area_id
FROM area a
, (SELECT date '2019-01-19') AS d(the_date)
, LATERAL (
SELECT generate_series(d.the_date - last_observed
, d.the_date - first_observed) AS days_to_date
, round(avg(price)) AS price
FROM table_example
WHERE area_id = a.area_id
GROUP BY 1
) f;
假设 table area
,显然。