使用多个内部和外部连接加速查询

Speed Up Query with Multiple Inner and Outer Joins

我在处理缓慢的 Postgresql 查询时遇到问题。我已经完成了标准 postgresql.conf 更改并验证了引用的列已编入索引。除此之外,我不确定下一步是什么。下面的查询只需不到 3 分钟即可到达 运行。感谢任何帮助。

select distinct
     exp.assay_id as ASSAY_KEY,
     rest.result_type_id as RESULT_TYPE_ID,
     rest.name as RESULT_TYPE,
     rest.unit as REST_UNIT,
     dtrest.name as REST_DATA_TYPE,
     cont.condition_type_id as COND_TYPE_ID,
     cont.name as COND_TYPE,
     cont.unit as COND_UNIT,
     dtcont.name as COND_DATA_TYPE,
     expcon.unit as EXP_COND_UNIT
from
     public.experiment exp
     inner join public.experiment_result expr on expr.experiment_id = exp.experiment_id
     inner join public.result_type rest on rest.result_type_id = expr.result_type_id
     left outer join public.experiment_condition expcon on expcon.experiment_id = expr.experiment_id
     left outer join public.condition_type cont on cont.condition_type_id = expcon.condition_type_id
     left outer join public.data_type dtcont on dtcont.data_type_id = cont.data_type_id
     left outer join public.data_type dtrest on dtrest.data_type_ID = rest.data_type_ID
where
     exp.assay_id in (255)

解释分析结果:

Unique  (cost=51405438.73..52671302.26 rows=50634541 width=1109) (actual time=123349.423..164779.863 rows=3 loops=1)
  ->  Sort  (cost=51405438.73..51532025.09 rows=50634541 width=1109) (actual time=123349.421..157973.215 rows=29521242 loops=1)
        Sort Key: rest.result_type_id, rest.name, rest.unit, dtrest.name, cont.condition_type_id, cont.name, cont.unit, dtcont.name, expcon.unit
        Sort Method: external merge  Disk: 3081440kB
        ->  Hash Left Join  (cost=56379.88..1743073.05 rows=50634541 width=1109) (actual time=1307.931..26398.626 rows=29521242 loops=1)
              Hash Cond: (rest.data_type_id = dtrest.data_type_id)
              ->  Hash Left Join  (cost=56378.68..1547566.26 rows=50634541 width=799) (actual time=1307.894..21181.787 rows=29521242 loops=1)
                    Hash Cond: (expr.experiment_id = expcon.experiment_id)
                    ->  Hash Join  (cost=5096.61..572059.62 rows=15984826 width=47) (actual time=1002.697..11046.550 rows=9840414 loops=1)
                          Hash Cond: (expr.result_type_id = rest.result_type_id)
                          ->  Hash Join  (cost=5091.86..528637.07 rows=15984826 width=24) (actual time=44.062..7969.272 rows=9840414 loops=1)
                                Hash Cond: (expr.experiment_id = exp.experiment_id)
                                ->  Seq Scan on experiment_result expr  (cost=0.00..462557.70 rows=23232570 width=16) (actual time=0.080..4357.646 rows=23232570 loops=1)
                                ->  Hash  (cost=3986.11..3986.11 rows=88460 width=16) (actual time=43.743..43.744 rows=88135 loops=1)
                                      Buckets: 131072  Batches: 1  Memory Usage: 5156kB
                                      ->  Seq Scan on experiment exp  (cost=0.00..3986.11 rows=88460 width=16) (actual time=0.016..24.426 rows=88135 loops=1)
                                            Filter: (assay_id = 255)
                                            Rows Removed by Filter: 40434
                          ->  Hash  (cost=3.22..3.22 rows=122 width=31) (actual time=958.617..958.618 rows=128 loops=1)
                                Buckets: 1024  Batches: 1  Memory Usage: 17kB
                                ->  Seq Scan on result_type rest  (cost=0.00..3.22 rows=122 width=31) (actual time=958.542..958.575 rows=128 loops=1)
                    ->  Hash  (cost=9509.53..9509.53 rows=382603 width=768) (actual time=294.654..294.658 rows=382553 loops=1)
                          Buckets: 16384  Batches: 32  Memory Usage: 1077kB
                          ->  Hash Left Join  (cost=2.67..9509.53 rows=382603 width=768) (actual time=0.074..176.040 rows=382553 loops=1)
                                Hash Cond: (cont.data_type_id = dtcont.data_type_id)
                                ->  Hash Left Join  (cost=1.47..8301.31 rows=382603 width=458) (actual time=0.048..117.994 rows=382553 loops=1)
                                      Hash Cond: (expcon.condition_type_id = cont.condition_type_id)
                                      ->  Seq Scan on experiment_condition expcon  (cost=0.00..7102.03 rows=382603 width=74) (actual time=0.016..48.704 rows=382553 loops=1)
                                      ->  Hash  (cost=1.21..1.21 rows=21 width=392) (actual time=0.021..0.022 rows=24 loops=1)
                                            Buckets: 1024  Batches: 1  Memory Usage: 10kB
                                            ->  Seq Scan on condition_type cont  (cost=0.00..1.21 rows=21 width=392) (actual time=0.012..0.014 rows=24 loops=1)
                                ->  Hash  (cost=1.09..1.09 rows=9 width=326) (actual time=0.015..0.016 rows=9 loops=1)
                                      Buckets: 1024  Batches: 1  Memory Usage: 9kB
                                      ->  Seq Scan on data_type dtcont  (cost=0.00..1.09 rows=9 width=326) (actual time=0.008..0.010 rows=9 loops=1)
              ->  Hash  (cost=1.09..1.09 rows=9 width=326) (actual time=0.018..0.019 rows=9 loops=1)
                    Buckets: 1024  Batches: 1  Memory Usage: 9kB
                    ->  Seq Scan on data_type dtrest  (cost=0.00..1.09 rows=9 width=326) (actual time=0.012..0.014 rows=9 loops=1)
Planning Time: 5.997 ms
JIT:
  Functions: 55
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 19.084 ms, Inlining 20.283 ms, Optimization 604.666 ms, Emission 332.835 ms, Total 976.868 ms
Execution Time: 165268.155 ms

查询必须从连接开始处理 3000 万行,因为您的条件 exp.assay_id in (255) 不是很严格。

碰巧这些结果行中的大多数都是相同的,因此 DISTINCT 之后只剩下三个不同的行。

所以没有办法让这个查询快如闪电——它必须查看 3000 万行才能确定只有三个不同的行。

但是大部分执行时间(165 秒中的 132 秒)都花在了排序上,因此应该可以使查询更快。

一些尝试的想法:

  • 尽可能增加work_mem,这样排序会更快。

PostgreSQL 选择显式排序是因为它不知道有这么多相同的行。否则它会选择一个更快的散列聚合。也许我们可以利用这个:

  • 尝试SET enable_sort = off;查询,看看这是否让 PostgreSQL 选择哈希聚合。

  • 升级到 PostgreSQL v13,它在哈希聚合方面变得更加智能,并且更愿意使用它们。