通过内部连接组提高 jsonb 交叉连接上 ORDER BY 的性能

Improving performance of ORDER BY on jsonb cross join with inner join group by

我在 PostgreSQL 12 中有两个表:一个 dataset 有很多 cfiles 和一个 cfile有一个数据集

SELECT * FROM datasets;
 id |   name   
----+----------
  1 | dataset1
  2 | dataset2

SELECT * FROM cfiles;
 id | dataset_id |                property_values (jsonb)        
----+------------+-----------------------------------------------
  1 |          1 | {"Sample Names": ["SampA", "SampB", "SampC"]}
  2 |          1 | {"Sample Names": ["SampA", "SampB", "SampD"]}
  3 |          1 | {"Sample Names": ["SampE"]}
  4 |          2 | {"Sample Names": ["SampA, SampF"]}
  5 |          2 | {"Sample Names": ["SampG"]}

我试图得到这个结果:

 id |   name   |           sample_names            
----+----------+-----------------------------------
  1 | dataset1 | SampA; SampB; SampC; SampD; SampE
  2 | dataset2 | SampA, SampF; SampG

根据 SO 问题和很好的答案,我有以下查询:

SELECT datasets.id, datasets.name,
string_agg(DISTINCT sn.sample_names, '; ' ORDER BY sn.sample_names) as sample_names
FROM cfiles
CROSS JOIN jsonb_array_elements_text(cfiles.property_values -> 'Sample Names') as sn(sample_names)
JOIN datasets on cfiles.dataset_id=datasets.id
GROUP BY datasets.id, datasets.name
-- Problematic line:
-- ORDER BY datasets.name
LIMIT 20;

在我需要订购结果之前,这非常有效。

对于没有 ORDER BY ~12ms,有 ORDER BY ~58881ms

的 45K cfile 行

下面是我的原始查询(来自上面的 SO 问题),它远没有那么优雅并使用字符串操作,但在 ~5150 毫秒时比交叉连接好 10 倍

SELECT datasets.id,
datasets.name,
ARRAY_TO_STRING(
  ARRAY(
    SELECT DISTINCT * FROM unnest(
      STRING_TO_ARRAY(
        STRING_AGG(
          DISTINCT REPLACE(
            REPLACE(
              REPLACE(
                REPLACE(
                  cfiles.property_values ->> 'Sample Names', '",' || chr(32) || '"', ';'
                ), '[' , ''
              ), '"' , ''
            ), ']' , ''
          ), ';'
        ), ';'
      )
    ) ORDER BY 1 ASC
  ), '; '
) as sample_names
FROM datasets
JOIN cfiles ON cfiles.dataset_id=datasets.id
GROUP BY datasets.id, datasets.name
ORDER BY datasets.name
LIMIT 20;

有什么方法可以提高上面的交叉连接查询(包括 ORDER BY)的性能,使其比字符串操作替代方案更快?

交叉连接查询的查询计划没有 ORDER BY

                                                                                                                               QUERY PLAN                                                                                                                                
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.69..4351.18 rows=20 width=106) (actual time=0.409..11.706 rows=20 loops=1)
   Output: datasets.id, datasets.name, (string_agg(DISTINCT sn.sample_names, '; '::text ORDER BY sn.sample_names))
   ->  GroupAggregate  (cost=0.69..132907.88 rows=611 width=106) (actual time=0.407..11.694 rows=20 loops=1)
         Output: datasets.id, datasets.name, string_agg(DISTINCT sn.sample_names, '; '::text ORDER BY sn.sample_names)
         Group Key: datasets.id
         ->  Nested Loop  (cost=0.69..109992.24 rows=4581600 width=106) (actual time=0.065..10.742 rows=207 loops=1)
               Output: datasets.id, datasets.name, sn.sample_names
               ->  Merge Join  (cost=0.69..18360.24 rows=45816 width=527) (actual time=0.042..5.155 rows=1697 loops=1)
                     Output: cfiles.property_values, datasets.id, datasets.name
                     Inner Unique: true
                     Merge Cond: (cfiles.dataset_id = datasets.id)
                     ->  Index Scan using index_cfiles_dataset_id_path on public.cfiles  (cost=0.41..17682.45 rows=45816 width=461) (actual time=0.016..2.665 rows=1697 loops=1)
                           Output: cfiles.id, cfiles.tid, cfiles.uuid, cfiles.dataset_id, cfiles.path, cfiles.name, cfiles.checksum, cfiles.size, cfiles.last_modified, cfiles.content_type, cfiles.locked, cfiles.property_values, cfiles.created_at, cfiles.updated_at
                     ->  Index Scan using datasets_pkey on public.datasets  (cost=0.28..103.56 rows=611 width=74) (actual time=0.016..0.066 rows=27 loops=1)
                           Output: datasets.id, datasets.tid, datasets.bucket_path_id, datasets.path, datasets.name, datasets.last_modified, datasets.file_count, datasets.size, datasets.content_types, datasets.locked, datasets.created_at, datasets.updated_at
               ->  Function Scan on pg_catalog.jsonb_array_elements_text sn  (cost=0.01..1.00 rows=100 width=32) (actual time=0.002..0.002 rows=0 loops=1697)
                     Output: sn.sample_names
                     Function Call: jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))
 Planning Time: 0.926 ms
 Execution Time: 11.845 ms
(20 rows)

交叉连接查询的查询计划 with ORDER BY

                                                                               QUERY PLAN                                                                               
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=130727.27..130727.32 rows=20 width=106) (actual time=60970.131..60970.140 rows=20 loops=1)
   Output: datasets.id, datasets.name, (string_agg(DISTINCT sn.sample_names, '; '::text ORDER BY sn.sample_names))
   ->  Sort  (cost=130727.27..130728.79 rows=611 width=106) (actual time=60970.128..60970.132 rows=20 loops=1)
         Output: datasets.id, datasets.name, (string_agg(DISTINCT sn.sample_names, '; '::text ORDER BY sn.sample_names))
         Sort Key: datasets.name
         Sort Method: top-N heapsort  Memory: 27kB
         ->  GroupAggregate  (cost=10585.66..130711.01 rows=611 width=106) (actual time=112.152..60965.350 rows=598 loops=1)
               Output: datasets.id, datasets.name, string_agg(DISTINCT sn.sample_names, '; '::text ORDER BY sn.sample_names)
               Group Key: datasets.id
               ->  Nested Loop  (cost=10585.66..107795.37 rows=4581600 width=106) (actual time=111.856..4959.284 rows=3289130 loops=1)
                     Output: datasets.id, datasets.name, sn.sample_names
                     ->  Gather Merge  (cost=10585.66..16163.37 rows=45816 width=527) (actual time=111.828..207.605 rows=45816 loops=1)
                           Output: cfiles.property_values, datasets.id, datasets.name
                           Workers Planned: 2
                           Workers Launched: 2
                           ->  Merge Join  (cost=9585.63..9875.04 rows=19090 width=527) (actual time=100.410..132.173 rows=15272 loops=3)
                                 Output: cfiles.property_values, datasets.id, datasets.name
                                 Inner Unique: true
                                 Merge Cond: (cfiles.dataset_id = datasets.id)
                                 Worker 0: actual time=94.756..119.875 rows=12140 loops=1
                                 Worker 1: actual time=95.064..120.437 rows=12454 loops=1
                                 ->  Sort  (cost=9529.25..9576.97 rows=19090 width=461) (actual time=99.259..114.951 rows=15272 loops=3)
                                       Output: cfiles.property_values, cfiles.dataset_id
                                       Sort Key: cfiles.dataset_id
                                       Sort Method: external merge  Disk: 10192kB
                                       Worker 0:  Sort Method: external merge  Disk: 5568kB
                                       Worker 1:  Sort Method: external merge  Disk: 5592kB
                                       Worker 0: actual time=93.461..105.574 rows=12140 loops=1
                                       Worker 1: actual time=93.784..105.796 rows=12454 loops=1
                                       ->  Parallel Seq Scan on public.cfiles  (cost=0.00..4188.90 rows=19090 width=461) (actual time=0.028..21.442 rows=15272 loops=3)
                                             Output: cfiles.property_values, cfiles.dataset_id
                                             Worker 0: actual time=0.036..22.118 rows=12140 loops=1
                                             Worker 1: actual time=0.035..22.162 rows=12454 loops=1
                                 ->  Sort  (cost=56.38..57.91 rows=611 width=74) (actual time=1.133..1.334 rows=603 loops=3)
                                       Output: datasets.id, datasets.name
                                       Sort Key: datasets.id
                                       Sort Method: quicksort  Memory: 110kB
                                       Worker 0:  Sort Method: quicksort  Memory: 110kB
                                       Worker 1:  Sort Method: quicksort  Memory: 110kB
                                       Worker 0: actual time=1.272..1.471 rows=611 loops=1
                                       Worker 1: actual time=1.259..1.474 rows=611 loops=1
                                       ->  Seq Scan on public.datasets  (cost=0.00..28.11 rows=611 width=74) (actual time=0.100..0.584 rows=611 loops=3)
                                             Output: datasets.id, datasets.name
                                             Worker 0: actual time=0.155..0.719 rows=611 loops=1
                                             Worker 1: actual time=0.121..0.667 rows=611 loops=1
                     ->  Function Scan on pg_catalog.jsonb_array_elements_text sn  (cost=0.01..1.00 rows=100 width=32) (actual time=0.051..0.067 rows=72 loops=45816)
                           Output: sn.sample_names
                           Function Call: jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))
 Planning Time: 0.894 ms
 Execution Time: 60972.185 ms
(50 rows)

更新 2: 下面@bobflux 的查询计划将其缩短到 9 毫秒!

                                                                                                                                                                                                                                                                             QUERY PLAN                                                                         
                                                             
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------
 Sort  (cost=25228.68..25228.73 rows=20 width=72) (actual time=8.166..8.177 rows=14 loops=1)
   Output: ds_1.dataset_id, ds.dataset_name, (string_agg(DISTINCT (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))), '; '::text ORDER BY (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text)))))
   Sort Key: ds.dataset_name
   Sort Method: quicksort  Memory: 26kB
   CTE ds
     ->  Limit  (cost=0.69..16.76 rows=20 width=74) (actual time=0.059..0.313 rows=20 loops=1)
           Output: datasets.id, datasets.name
           ->  Nested Loop Semi Join  (cost=0.69..488.56 rows=607 width=74) (actual time=0.057..0.302 rows=20 loops=1)
                 Output: datasets.id, datasets.name
                 ->  Index Only Scan using datasets_name_id on public.datasets  (cost=0.28..137.44 rows=611 width=74) (actual time=0.028..0.062 rows=20 loops=1)
                       Output: datasets.name, datasets.id
                       Heap Fetches: 20
                 ->  Index Only Scan using index_cfiles_dataset_id_path on public.cfiles cfiles_1  (cost=0.41..5.79 rows=75 width=8) (actual time=0.010..0.010 rows=1 loops=20)
                       Output: cfiles_1.dataset_id, cfiles_1.path
                       Index Cond: (cfiles_1.dataset_id = datasets.id)
                       Heap Fetches: 0
   ->  Hash Join  (cost=24073.53..25211.48 rows=20 width=72) (actual time=7.261..8.025 rows=14 loops=1)
         Output: ds_1.dataset_id, ds.dataset_name, (string_agg(DISTINCT (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))), '; '::text ORDER BY (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text)))))
         Hash Cond: (ds_1.dataset_id = ds.dataset_id)
         ->  GroupAggregate  (cost=24072.88..25207.88 rows=200 width=40) (actual time=6.862..7.602 rows=14 loops=1)
               Output: ds_1.dataset_id, string_agg(DISTINCT (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))), '; '::text ORDER BY (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))))
               Group Key: ds_1.dataset_id
               ->  Sort  (cost=24072.88..24450.38 rows=151000 width=40) (actual time=6.688..6.744 rows=259 loops=1)
                     Output: ds_1.dataset_id, (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text)))
                     Sort Key: ds_1.dataset_id
                     Sort Method: quicksort  Memory: 44kB
                     ->  ProjectSet  (cost=0.41..5443.72 rows=151000 width=40) (actual time=4.419..6.469 rows=259 loops=1)
                           Output: ds_1.dataset_id, jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))
                           ->  Nested Loop  (cost=0.41..4673.62 rows=1510 width=459) (actual time=0.028..4.285 rows=1749 loops=1)
                                 Output: cfiles.property_values, ds_1.dataset_id
                                 ->  CTE Scan on ds ds_1  (cost=0.00..0.40 rows=20 width=8) (actual time=0.001..0.012 rows=20 loops=1)
                                       Output: ds_1.dataset_id, ds_1.dataset_name
                                 ->  Index Scan using index_cfiles_dataset_id_path on public.cfiles  (cost=0.41..232.91 rows=75 width=459) (actual time=0.012..0.129 rows=87 loops=20)
                                       Output: cfiles.id, cfiles.tid, cfiles.uuid, cfiles.dataset_id, cfiles.path, cfiles.name, cfiles.checksum, cfiles.size, cfiles.last_modified, cfiles.content_type, cfiles.locked, cfiles.property_values, cfiles.created_at, cfiles.updated_at
                                       Index Cond: (cfiles.dataset_id = ds_1.dataset_id)
         ->  Hash  (cost=0.40..0.40 rows=20 width=40) (actual time=0.382..0.383 rows=20 loops=1)
               Output: ds.dataset_name, ds.dataset_id
               Buckets: 1024  Batches: 1  Memory Usage: 10kB
               ->  CTE Scan on ds  (cost=0.00..0.40 rows=20 width=40) (actual time=0.067..0.356 rows=20 loops=1)
                     Output: ds.dataset_name, ds.dataset_id
 Planning Time: 1.781 ms
 Execution Time: 8.386 ms
(42 rows)

DISTINCT聚合函数不是PostgreSQL的强项

也许这会表现得更好:

SELECT id, name,
       string_agg(sample_names, '; ' ORDER BY sample_names) AS sample_names
FROM (SELECT DISTINCT datasets.id, datasets.name, sn.sample_names
      FROM cfiles
         CROSS JOIN jsonb_array_elements_text(
                       cfiles.property_values -> 'Sample Names'
                    ) AS sn(sample_names)
         JOIN datasets on cfiles.dataset_id = datasets.id
     ) AS q
GROUP BY id, name
ORDER BY name
LIMIT 20;

让我们在 postgresl 13 上使用 600 个数据集、45k 个文件创建测试数据。

BEGIN;

CREATE TABLE cfiles (
 id SERIAL PRIMARY KEY, 
 dataset_id INTEGER NOT NULL,
 property_values jsonb NOT NULL);

INSERT INTO cfiles (dataset_id,property_values)
 SELECT 1+(random()*600)::INTEGER  AS did, 
   ('{"Sample Names": ["'||array_to_string(array_agg(DISTINCT prop),'","')||'"]}')::jsonb prop 
   FROM (
     SELECT 1+(random()*45000)::INTEGER AS cid,
     'Samp'||(power(random(),2)*30)::INTEGER AS prop 
     FROM generate_series(1,45000*4)) foo 
   GROUP BY cid;

COMMIT;
CREATE TABLE datasets ( id INTEGER PRIMARY KEY, name TEXT NOT NULL );
INSERT INTO datasets SELECT n, 'dataset'||n FROM (SELECT DISTINCT dataset_id n FROM cfiles) foo;
CREATE INDEX cfiles_dataset ON cfiles(dataset_id);
VACUUM ANALYZE cfiles;
VACUUM ANALYZE datasets;

您的原始查询在这里要快得多,但这可能是因为 postgres 13 更智能。

 Sort  (cost=114127.87..114129.37 rows=601 width=46) (actual time=658.943..659.012 rows=601 loops=1)
   Sort Key: datasets.name
   Sort Method: quicksort  Memory: 334kB
   ->  GroupAggregate  (cost=0.57..114100.13 rows=601 width=46) (actual time=13.954..655.916 rows=601 loops=1)
         Group Key: datasets.id
         ->  Nested Loop  (cost=0.57..92009.62 rows=4416600 width=46) (actual time=13.373..360.991 rows=163540 loops=1)
               ->  Merge Join  (cost=0.56..3677.61 rows=44166 width=78) (actual time=13.350..113.567 rows=44166 loops=1)
                     Merge Cond: (cfiles.dataset_id = datasets.id)
                     ->  Index Scan using cfiles_dataset on cfiles  (cost=0.29..3078.75 rows=44166 width=68) (actual time=0.015..69.098 rows=44166 loops=1)
                     ->  Index Scan using datasets_pkey on datasets  (cost=0.28..45.29 rows=601 width=14) (actual time=0.024..0.580 rows=601 loops=1)
               ->  Function Scan on jsonb_array_elements_text sn  (cost=0.01..1.00 rows=100 width=32) (actual time=0.003..0.004 rows=4 loops=44166)
 Execution Time: 661.978 ms

此查询首先读取一个大的 table (cfiles),并且由于聚合而产生的行少得多。因此,在要连接的行数减少之后连接数据集会更快,而不是之前。让我们移动连接。我也摆脱了不必要的 CROSS JOIN,当 SELECT 中有一个返回函数时 postgres 会免费做你想做的事。

SELECT dataset_id, d.name, sample_names FROM (
 SELECT dataset_id, string_agg(sn, '; ') as sample_names FROM (
  SELECT DISTINCT dataset_id,
   jsonb_array_elements_text(cfiles.property_values -> 'Sample Names') AS sn
   FROM cfiles
   ) f GROUP BY dataset_id
  )g JOIN datasets d ON (d.id=g.dataset_id)
 ORDER BY d.name;
                                                                   QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=536207.44..536207.94 rows=200 width=46) (actual time=264.435..264.502 rows=601 loops=1)
   Sort Key: d.name
   Sort Method: quicksort  Memory: 334kB
   ->  Hash Join  (cost=536188.20..536199.79 rows=200 width=46) (actual time=261.404..261.784 rows=601 loops=1)
         Hash Cond: (d.id = cfiles.dataset_id)
         ->  Seq Scan on datasets d  (cost=0.00..10.01 rows=601 width=14) (actual time=0.025..0.124 rows=601 loops=1)
         ->  Hash  (cost=536185.70..536185.70 rows=200 width=36) (actual time=261.361..261.363 rows=601 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 170kB
               ->  HashAggregate  (cost=536181.20..536183.70 rows=200 width=36) (actual time=260.805..261.054 rows=601 loops=1)
                     Group Key: cfiles.dataset_id
                     Batches: 1  Memory Usage: 1081kB
                     ->  HashAggregate  (cost=409982.82..507586.70 rows=1906300 width=36) (actual time=244.419..253.094 rows=18547 loops=1)
                           Group Key: cfiles.dataset_id, jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))
                           Planned Partitions: 4  Batches: 1  Memory Usage: 13329kB
                           ->  ProjectSet  (cost=0.00..23530.32 rows=4416600 width=36) (actual time=0.030..159.741 rows=163540 loops=1)
                                 ->  Seq Scan on cfiles  (cost=0.00..1005.66 rows=44166 width=68) (actual time=0.006..9.588 rows=44166 loops=1)
 Planning Time: 0.247 ms
 Execution Time: 269.362 ms

这样更好。但是我在您的查询中看到了一个 LIMIT,这意味着您可能正在做类似分页的事情。在这种情况下,只需要计算整个 cfiles table 的整个查询,然后由于 LIMIT 而丢弃大部分结果,如果该大查询的结果可以更改是否包含数据集中的一行在最终结果与否。如果是这样的话,那么数据集中没有相应cfiles的行将不会出现在最终结果中,这意味着cfiles的内容会影响分页。好吧,我们总是可以作弊:要知道是否必须包含来自数据集的一行,所需要的只是来自 cfiles 的一行具有该 id...

因此,为了知道最终结果中将包含哪些数据集行,我们可以使用以下两个查询之一:

SELECT id FROM datasets WHERE EXISTS( SELECT * FROM cfiles WHERE cfiles.dataset_id = datasets.id )
ORDER BY name LIMIT 20;

SELECT dataset_id FROM 
  (SELECT id AS dataset_id, name AS dataset_name FROM datasets ORDER BY dataset_name) f1
  WHERE EXISTS( SELECT * FROM cfiles WHERE cfiles.dataset_id = f1.dataset_id )
  ORDER BY dataset_name
  LIMIT 20;

这些大约需要 2-3 毫秒。我们也可以作弊:

CREATE INDEX datasets_name_id ON datasets(name,id);

这将它降低到大约 300 微秒。所以,现在我们得到了实际使用(而不是丢弃)的 dataset_id 列表,因此我们可以使用它来仅对最终结果中实际存在的行执行大型慢速聚合,这应该节省大量不必要的工作...

WITH ds AS (SELECT id AS dataset_id, name AS dataset_name
 FROM datasets WHERE EXISTS( SELECT * FROM cfiles WHERE cfiles.dataset_id = datasets.id )
 ORDER BY name LIMIT 20)

SELECT dataset_id, dataset_name, sample_names FROM (
 SELECT dataset_id, string_agg(DISTINCT sn, '; ' ORDER BY sn) as sample_names FROM (
  SELECT dataset_id, 
   jsonb_array_elements_text(cfiles.property_values -> 'Sample Names') AS sn 
   FROM ds JOIN cfiles USING (dataset_id)
  ) g GROUP BY dataset_id
  ) h JOIN ds USING (dataset_id)
 ORDER BY dataset_name;

这大约需要30ms,我也是按之前忘记的sample_name排序的。它应该适合你的情况。重要的一点是查询时间不再取决于 table cfiles 的大小,因为它只会处理实际需要的行。

请post结果;)