使用多个内部和外部连接加速查询
Speed Up Query with Multiple Inner and Outer Joins
我在处理缓慢的 Postgresql 查询时遇到问题。我已经完成了标准 postgresql.conf 更改并验证了引用的列已编入索引。除此之外,我不确定下一步是什么。下面的查询只需不到 3 分钟即可到达 运行。感谢任何帮助。
select distinct
exp.assay_id as ASSAY_KEY,
rest.result_type_id as RESULT_TYPE_ID,
rest.name as RESULT_TYPE,
rest.unit as REST_UNIT,
dtrest.name as REST_DATA_TYPE,
cont.condition_type_id as COND_TYPE_ID,
cont.name as COND_TYPE,
cont.unit as COND_UNIT,
dtcont.name as COND_DATA_TYPE,
expcon.unit as EXP_COND_UNIT
from
public.experiment exp
inner join public.experiment_result expr on expr.experiment_id = exp.experiment_id
inner join public.result_type rest on rest.result_type_id = expr.result_type_id
left outer join public.experiment_condition expcon on expcon.experiment_id = expr.experiment_id
left outer join public.condition_type cont on cont.condition_type_id = expcon.condition_type_id
left outer join public.data_type dtcont on dtcont.data_type_id = cont.data_type_id
left outer join public.data_type dtrest on dtrest.data_type_ID = rest.data_type_ID
where
exp.assay_id in (255)
解释分析结果:
Unique (cost=51405438.73..52671302.26 rows=50634541 width=1109) (actual time=123349.423..164779.863 rows=3 loops=1)
-> Sort (cost=51405438.73..51532025.09 rows=50634541 width=1109) (actual time=123349.421..157973.215 rows=29521242 loops=1)
Sort Key: rest.result_type_id, rest.name, rest.unit, dtrest.name, cont.condition_type_id, cont.name, cont.unit, dtcont.name, expcon.unit
Sort Method: external merge Disk: 3081440kB
-> Hash Left Join (cost=56379.88..1743073.05 rows=50634541 width=1109) (actual time=1307.931..26398.626 rows=29521242 loops=1)
Hash Cond: (rest.data_type_id = dtrest.data_type_id)
-> Hash Left Join (cost=56378.68..1547566.26 rows=50634541 width=799) (actual time=1307.894..21181.787 rows=29521242 loops=1)
Hash Cond: (expr.experiment_id = expcon.experiment_id)
-> Hash Join (cost=5096.61..572059.62 rows=15984826 width=47) (actual time=1002.697..11046.550 rows=9840414 loops=1)
Hash Cond: (expr.result_type_id = rest.result_type_id)
-> Hash Join (cost=5091.86..528637.07 rows=15984826 width=24) (actual time=44.062..7969.272 rows=9840414 loops=1)
Hash Cond: (expr.experiment_id = exp.experiment_id)
-> Seq Scan on experiment_result expr (cost=0.00..462557.70 rows=23232570 width=16) (actual time=0.080..4357.646 rows=23232570 loops=1)
-> Hash (cost=3986.11..3986.11 rows=88460 width=16) (actual time=43.743..43.744 rows=88135 loops=1)
Buckets: 131072 Batches: 1 Memory Usage: 5156kB
-> Seq Scan on experiment exp (cost=0.00..3986.11 rows=88460 width=16) (actual time=0.016..24.426 rows=88135 loops=1)
Filter: (assay_id = 255)
Rows Removed by Filter: 40434
-> Hash (cost=3.22..3.22 rows=122 width=31) (actual time=958.617..958.618 rows=128 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 17kB
-> Seq Scan on result_type rest (cost=0.00..3.22 rows=122 width=31) (actual time=958.542..958.575 rows=128 loops=1)
-> Hash (cost=9509.53..9509.53 rows=382603 width=768) (actual time=294.654..294.658 rows=382553 loops=1)
Buckets: 16384 Batches: 32 Memory Usage: 1077kB
-> Hash Left Join (cost=2.67..9509.53 rows=382603 width=768) (actual time=0.074..176.040 rows=382553 loops=1)
Hash Cond: (cont.data_type_id = dtcont.data_type_id)
-> Hash Left Join (cost=1.47..8301.31 rows=382603 width=458) (actual time=0.048..117.994 rows=382553 loops=1)
Hash Cond: (expcon.condition_type_id = cont.condition_type_id)
-> Seq Scan on experiment_condition expcon (cost=0.00..7102.03 rows=382603 width=74) (actual time=0.016..48.704 rows=382553 loops=1)
-> Hash (cost=1.21..1.21 rows=21 width=392) (actual time=0.021..0.022 rows=24 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> Seq Scan on condition_type cont (cost=0.00..1.21 rows=21 width=392) (actual time=0.012..0.014 rows=24 loops=1)
-> Hash (cost=1.09..1.09 rows=9 width=326) (actual time=0.015..0.016 rows=9 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on data_type dtcont (cost=0.00..1.09 rows=9 width=326) (actual time=0.008..0.010 rows=9 loops=1)
-> Hash (cost=1.09..1.09 rows=9 width=326) (actual time=0.018..0.019 rows=9 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on data_type dtrest (cost=0.00..1.09 rows=9 width=326) (actual time=0.012..0.014 rows=9 loops=1)
Planning Time: 5.997 ms
JIT:
Functions: 55
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 19.084 ms, Inlining 20.283 ms, Optimization 604.666 ms, Emission 332.835 ms, Total 976.868 ms
Execution Time: 165268.155 ms
查询必须从连接开始处理 3000 万行,因为您的条件 exp.assay_id in (255)
不是很严格。
碰巧这些结果行中的大多数都是相同的,因此 DISTINCT
之后只剩下三个不同的行。
所以没有办法让这个查询快如闪电——它必须查看 3000 万行才能确定只有三个不同的行。
但是大部分执行时间(165 秒中的 132 秒)都花在了排序上,因此应该可以使查询更快。
一些尝试的想法:
- 尽可能增加
work_mem
,这样排序会更快。
PostgreSQL 选择显式排序是因为它不知道有这么多相同的行。否则它会选择一个更快的散列聚合。也许我们可以利用这个:
尝试SET enable_sort = off;
查询,看看这是否让 PostgreSQL 选择哈希聚合。
升级到 PostgreSQL v13,它在哈希聚合方面变得更加智能,并且更愿意使用它们。
我在处理缓慢的 Postgresql 查询时遇到问题。我已经完成了标准 postgresql.conf 更改并验证了引用的列已编入索引。除此之外,我不确定下一步是什么。下面的查询只需不到 3 分钟即可到达 运行。感谢任何帮助。
select distinct
exp.assay_id as ASSAY_KEY,
rest.result_type_id as RESULT_TYPE_ID,
rest.name as RESULT_TYPE,
rest.unit as REST_UNIT,
dtrest.name as REST_DATA_TYPE,
cont.condition_type_id as COND_TYPE_ID,
cont.name as COND_TYPE,
cont.unit as COND_UNIT,
dtcont.name as COND_DATA_TYPE,
expcon.unit as EXP_COND_UNIT
from
public.experiment exp
inner join public.experiment_result expr on expr.experiment_id = exp.experiment_id
inner join public.result_type rest on rest.result_type_id = expr.result_type_id
left outer join public.experiment_condition expcon on expcon.experiment_id = expr.experiment_id
left outer join public.condition_type cont on cont.condition_type_id = expcon.condition_type_id
left outer join public.data_type dtcont on dtcont.data_type_id = cont.data_type_id
left outer join public.data_type dtrest on dtrest.data_type_ID = rest.data_type_ID
where
exp.assay_id in (255)
解释分析结果:
Unique (cost=51405438.73..52671302.26 rows=50634541 width=1109) (actual time=123349.423..164779.863 rows=3 loops=1)
-> Sort (cost=51405438.73..51532025.09 rows=50634541 width=1109) (actual time=123349.421..157973.215 rows=29521242 loops=1)
Sort Key: rest.result_type_id, rest.name, rest.unit, dtrest.name, cont.condition_type_id, cont.name, cont.unit, dtcont.name, expcon.unit
Sort Method: external merge Disk: 3081440kB
-> Hash Left Join (cost=56379.88..1743073.05 rows=50634541 width=1109) (actual time=1307.931..26398.626 rows=29521242 loops=1)
Hash Cond: (rest.data_type_id = dtrest.data_type_id)
-> Hash Left Join (cost=56378.68..1547566.26 rows=50634541 width=799) (actual time=1307.894..21181.787 rows=29521242 loops=1)
Hash Cond: (expr.experiment_id = expcon.experiment_id)
-> Hash Join (cost=5096.61..572059.62 rows=15984826 width=47) (actual time=1002.697..11046.550 rows=9840414 loops=1)
Hash Cond: (expr.result_type_id = rest.result_type_id)
-> Hash Join (cost=5091.86..528637.07 rows=15984826 width=24) (actual time=44.062..7969.272 rows=9840414 loops=1)
Hash Cond: (expr.experiment_id = exp.experiment_id)
-> Seq Scan on experiment_result expr (cost=0.00..462557.70 rows=23232570 width=16) (actual time=0.080..4357.646 rows=23232570 loops=1)
-> Hash (cost=3986.11..3986.11 rows=88460 width=16) (actual time=43.743..43.744 rows=88135 loops=1)
Buckets: 131072 Batches: 1 Memory Usage: 5156kB
-> Seq Scan on experiment exp (cost=0.00..3986.11 rows=88460 width=16) (actual time=0.016..24.426 rows=88135 loops=1)
Filter: (assay_id = 255)
Rows Removed by Filter: 40434
-> Hash (cost=3.22..3.22 rows=122 width=31) (actual time=958.617..958.618 rows=128 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 17kB
-> Seq Scan on result_type rest (cost=0.00..3.22 rows=122 width=31) (actual time=958.542..958.575 rows=128 loops=1)
-> Hash (cost=9509.53..9509.53 rows=382603 width=768) (actual time=294.654..294.658 rows=382553 loops=1)
Buckets: 16384 Batches: 32 Memory Usage: 1077kB
-> Hash Left Join (cost=2.67..9509.53 rows=382603 width=768) (actual time=0.074..176.040 rows=382553 loops=1)
Hash Cond: (cont.data_type_id = dtcont.data_type_id)
-> Hash Left Join (cost=1.47..8301.31 rows=382603 width=458) (actual time=0.048..117.994 rows=382553 loops=1)
Hash Cond: (expcon.condition_type_id = cont.condition_type_id)
-> Seq Scan on experiment_condition expcon (cost=0.00..7102.03 rows=382603 width=74) (actual time=0.016..48.704 rows=382553 loops=1)
-> Hash (cost=1.21..1.21 rows=21 width=392) (actual time=0.021..0.022 rows=24 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> Seq Scan on condition_type cont (cost=0.00..1.21 rows=21 width=392) (actual time=0.012..0.014 rows=24 loops=1)
-> Hash (cost=1.09..1.09 rows=9 width=326) (actual time=0.015..0.016 rows=9 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on data_type dtcont (cost=0.00..1.09 rows=9 width=326) (actual time=0.008..0.010 rows=9 loops=1)
-> Hash (cost=1.09..1.09 rows=9 width=326) (actual time=0.018..0.019 rows=9 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on data_type dtrest (cost=0.00..1.09 rows=9 width=326) (actual time=0.012..0.014 rows=9 loops=1)
Planning Time: 5.997 ms
JIT:
Functions: 55
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 19.084 ms, Inlining 20.283 ms, Optimization 604.666 ms, Emission 332.835 ms, Total 976.868 ms
Execution Time: 165268.155 ms
查询必须从连接开始处理 3000 万行,因为您的条件 exp.assay_id in (255)
不是很严格。
碰巧这些结果行中的大多数都是相同的,因此 DISTINCT
之后只剩下三个不同的行。
所以没有办法让这个查询快如闪电——它必须查看 3000 万行才能确定只有三个不同的行。
但是大部分执行时间(165 秒中的 132 秒)都花在了排序上,因此应该可以使查询更快。
一些尝试的想法:
- 尽可能增加
work_mem
,这样排序会更快。
PostgreSQL 选择显式排序是因为它不知道有这么多相同的行。否则它会选择一个更快的散列聚合。也许我们可以利用这个:
尝试
SET enable_sort = off;
查询,看看这是否让 PostgreSQL 选择哈希聚合。升级到 PostgreSQL v13,它在哈希聚合方面变得更加智能,并且更愿意使用它们。