优化 Postgres 中 jsonb 数组值的 GROUP BY

Optimizing GROUP BY of jsonb array values in Postgres

更新以获得更好的硬件

我在 Postgres12 中有以下简化模式,其中 dataset 有许多 cfiles 并且每个 cfile 有 property_values 存储为 jsonb:

SELECT * FROM cfiles;
 id | dataset_id |                property_values (jsonb)                
----+------------+-----------------------------------------------
  1 |          1 | {"Sample Names": ["SampA", "SampB"], (...other properties)}
  2 |          1 | {"Sample Names": ["SampA", "SampC"], (...other properties)}
  3 |          1 | {"Sample Names": ["SampD"],          (...other properties)}

一些“示例名称”数组很大,可能包含多达 100 个短字符串(总共约 1K 个字符)

我正在使用下面的查询对“示例名称”jsonb 数组中的每个值进行分组。查询给出了我想要的结果,但在 ~45K 行 上需要大约 15 秒(在底部解释查询计划)运行在 aws t2.micro 上 1vCPU 和 1GB ram

SELECT
jsonb_array_elements_text(property_values -> 'Sample Names') as sample_names,
max(cfiles.last_modified) as last_modified,
string_agg(DISTINCT(users.email), ', ') as user_email,
string_agg(DISTINCT(groups.name), ', ') as group_name
FROM cfiles
JOIN datasets ON cfiles.dataset_id=datasets.id
LEFT JOIN user_permissions ON (user_permissions.cfile_id=cfiles.id OR user_permissions.dataset_id=datasets.id)
LEFT JOIN users on users.id=user_permissions.user_id
LEFT JOIN group_permissions ON (group_permissions.cfile_id=cfiles.id OR group_permissions.dataset_id=datasets.id)
LEFT JOIN groups ON groups.id=group_permissions.group_id
LEFT JOIN user_groups ON groups.id=user_groups.group_id
WHERE
cfiles.tid=5
-- Query needs to support sample name filtering eg:
-- AND property_values ->> 'Sample Names' like '%test%'
-- As well as filtering by columns from the other joined tables
GROUP BY sample_names
-- Below is required for pagination
ORDER BY "sample_names" desc NULLS LAST

所有 ID 列和连接列都已编入索引。

我曾尝试进一步减少查询,但发现它很棘手,因为很难将增量减少与改进相关联,而且我的最终结果需要包括所有内容。

只有没有任何连接的 cfiles 仍然需要大约 12 秒:

SELECT
jsonb_array_elements_text(property_values -> 'Sample Names') as sample_names,
max(cfiles.last_modified) as last_modified
FROM cfiles
WHERE
cfiles.tid=5
GROUP BY sample_names
ORDER BY "sample_names" desc NULLS LAST
LIMIT 20
OFFSET 0

我尝试在下面使用更多的 CTE 模式,但速度较慢并且没有产生正确的结果

WITH cf AS (
  SELECT
  cfiles.id as id,
  cfiles.dataset_id as dataset_id,
  jsonb_array_elements_text(property_values -> 'Sample Names') as sample_names,
  cfiles.last_modified as last_modified
  FROM cfiles
  WHERE
  cfiles.tid=5
  GROUP BY sample_names, id, dataset_id, last_modified
  ORDER BY "sample_names" desc NULLS LAST
  LIMIT 20
  OFFSET 0
)

SELECT cf.sample_names as sample_names,
max(cf.last_modified) as last_modified,
string_agg(DISTINCT(users.email), ', ') as user_email,
string_agg(DISTINCT(groups.name), ', ') as group_name
FROM cf
JOIN datasets ON cf.dataset_id=datasets.id
LEFT JOIN user_permissions ON (user_permissions.cfile_id=cf.id OR user_permissions.dataset_id=datasets.id)
LEFT JOIN users on users.id=user_permissions.user_id
LEFT JOIN group_permissions ON (group_permissions.cfile_id=cf.id OR group_permissions.dataset_id=datasets.id)
LEFT JOIN groups ON groups.id=group_permissions.group_id
LEFT JOIN user_groups ON groups.id=user_groups.group_id
GROUP BY sample_names
ORDER BY "sample_names" desc NULLS LAST

有没有其他方法可以将此查询的性能理想地缩短到几秒钟?我可以重新安排它,使用临时表和索引,以最有效的为准。

增加 RAM 是否有帮助?

更新详细解释分析:

                                                                                                                                                                                                                                                  QUERY PLAN                                                                                                                                                                                                                                                   
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=1046905.99..1046907.17 rows=20 width=104) (actual time=15394.929..15409.256 rows=20 loops=1)
   Output: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))), (max(cfiles.last_modified)), (string_agg(DISTINCT (users.email)::text, ', '::text)), (string_agg(DISTINCT (groups.name)::text, ', '::text))
   ->  GroupAggregate  (cost=1046905.99..1130738.74 rows=1426200 width=104) (actual time=15394.927..15409.228 rows=20 loops=1)
         Output: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))), max(cfiles.last_modified), string_agg(DISTINCT (users.email)::text, ', '::text), string_agg(DISTINCT (groups.name)::text, ', '::text)
         Group Key: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text)))
         ->  Sort  (cost=1046905.99..1057967.74 rows=4424700 width=104) (actual time=15394.877..15400.483 rows=11067 loops=1)
               Output: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))), cfiles.last_modified, users.email, groups.name
               Sort Key: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))) DESC NULLS LAST
               Sort Method: external merge  Disk: 163288kB
               ->  ProjectSet  (cost=41.40..74530.20 rows=4424700 width=104) (actual time=0.682..2933.628 rows=3399832 loops=1)
                     Output: jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text)), cfiles.last_modified, users.email, groups.name
                     ->  Nested Loop Left Join  (cost=41.40..51964.23 rows=44247 width=526) (actual time=0.587..1442.326 rows=46031 loops=1)
                           Output: cfiles.property_values, cfiles.last_modified, users.email, groups.name
                           Join Filter: ((user_permissions.cfile_id = cfiles.id) OR (user_permissions.dataset_id = datasets.id))
                           Rows Removed by Join Filter: 2391425
                           ->  Nested Loop Left Join  (cost=38.55..11694.81 rows=44247 width=510) (actual time=0.473..357.751 rows=46016 loops=1)
                                 Output: cfiles.property_values, cfiles.last_modified, cfiles.id, datasets.id, groups.name
                                 Join Filter: ((group_permissions.cfile_id = cfiles.id) OR (group_permissions.dataset_id = datasets.id))
                                 Rows Removed by Join Filter: 616311
                                 ->  Hash Join  (cost=35.81..4721.54 rows=44247 width=478) (actual time=0.388..50.189 rows=44255 loops=1)
                                       Output: cfiles.property_values, cfiles.last_modified, cfiles.id, datasets.id
                                       Inner Unique: true
                                       Hash Cond: (cfiles.dataset_id = datasets.id)
                                       ->  Seq Scan on public.cfiles  (cost=0.00..4568.77 rows=44247 width=478) (actual time=0.012..20.676 rows=44255 loops=1)
                                             Output: cfiles.id, cfiles.tid, cfiles.uuid, cfiles.dataset_id, cfiles.path, cfiles.name, cfiles.checksum, cfiles.size, cfiles.last_modified, cfiles.content_type, cfiles.locked, cfiles.property_values, cfiles.created_at, cfiles.updated_at
                                             Filter: (cfiles.tid = 5)
                                             Rows Removed by Filter: 1567
                                       ->  Hash  (cost=28.14..28.14 rows=614 width=8) (actual time=0.363..0.363 rows=614 loops=1)
                                             Output: datasets.id
                                             Buckets: 1024  Batches: 1  Memory Usage: 32kB
                                             ->  Seq Scan on public.datasets  (cost=0.00..28.14 rows=614 width=8) (actual time=0.004..0.194 rows=614 loops=1)
                                                   Output: datasets.id
                                 ->  Materialize  (cost=2.74..4.39 rows=9 width=48) (actual time=0.000..0.003 rows=14 loops=44255)
                                       Output: group_permissions.cfile_id, group_permissions.dataset_id, groups.name
                                       ->  Hash Right Join  (cost=2.74..4.35 rows=9 width=48) (actual time=0.049..0.071 rows=14 loops=1)
                                             Output: group_permissions.cfile_id, group_permissions.dataset_id, groups.name
                                             Hash Cond: (user_groups.group_id = groups.id)
                                             ->  Seq Scan on public.user_groups  (cost=0.00..1.38 rows=38 width=8) (actual time=0.003..0.011 rows=38 loops=1)
                                                   Output: user_groups.id, user_groups.tid, user_groups.user_id, user_groups.group_id, user_groups.created_at, user_groups.updated_at
                                             ->  Hash  (cost=2.62..2.62 rows=9 width=56) (actual time=0.039..0.039 rows=9 loops=1)
                                                   Output: group_permissions.cfile_id, group_permissions.dataset_id, groups.name, groups.id
                                                   Buckets: 1024  Batches: 1  Memory Usage: 9kB
                                                   ->  Hash Right Join  (cost=1.20..2.62 rows=9 width=56) (actual time=0.021..0.035 rows=9 loops=1)
                                                         Output: group_permissions.cfile_id, group_permissions.dataset_id, groups.name, groups.id
                                                         Hash Cond: (groups.id = group_permissions.group_id)
                                                         ->  Seq Scan on public.groups  (cost=0.00..1.24 rows=24 width=40) (actual time=0.003..0.008 rows=24 loops=1)
                                                               Output: groups.id, groups.tid, groups.name, groups.description, groups.default_uview, groups.created_at, groups.updated_at
                                                         ->  Hash  (cost=1.09..1.09 rows=9 width=24) (actual time=0.010..0.010 rows=9 loops=1)
                                                               Output: group_permissions.cfile_id, group_permissions.dataset_id, group_permissions.group_id
                                                               Buckets: 1024  Batches: 1  Memory Usage: 9kB
                                                               ->  Seq Scan on public.group_permissions  (cost=0.00..1.09 rows=9 width=24) (actual time=0.003..0.006 rows=9 loops=1)
                                                                     Output: group_permissions.cfile_id, group_permissions.dataset_id, group_permissions.group_id
                           ->  Materialize  (cost=2.85..4.78 rows=52 width=48) (actual time=0.000..0.010 rows=52 loops=46016)
                                 Output: user_permissions.cfile_id, user_permissions.dataset_id, users.email
                                 ->  Hash Left Join  (cost=2.85..4.52 rows=52 width=48) (actual time=0.037..0.076 rows=52 loops=1)
                                       Output: user_permissions.cfile_id, user_permissions.dataset_id, users.email
                                       Inner Unique: true
                                       Hash Cond: (user_permissions.user_id = users.id)
                                       ->  Seq Scan on public.user_permissions  (cost=0.00..1.52 rows=52 width=24) (actual time=0.003..0.014 rows=52 loops=1)
                                             Output: user_permissions.id, user_permissions.tid, user_permissions.user_id, user_permissions.dataset_id, user_permissions.cfile_id, user_permissions.read, user_permissions.share, user_permissions.write_meta, user_permissions.manage_files, user_permissions.delete_files, user_permissions.task_id, user_permissions.notified_at, user_permissions.first_downloaded_at, user_permissions.created_at, user_permissions.updated_at
                                       ->  Hash  (cost=2.38..2.38 rows=38 width=40) (actual time=0.029..0.030 rows=38 loops=1)
                                             Output: users.email, users.id
                                             Buckets: 1024  Batches: 1  Memory Usage: 11kB
                                             ->  Seq Scan on public.users  (cost=0.00..2.38 rows=38 width=40) (actual time=0.004..0.016 rows=38 loops=1)
                                                   Output: users.email, users.id
 Planning Time: 0.918 ms
 Execution Time: 15442.689 ms
(67 rows)

Time: 15575.174 ms (00:15.575)

更新 2

将 work_mem 增加到 256mb 后,没有任何连接查询的 cfiles 从 12 秒下降到 2 秒,但完整查询仍为 11 秒 - 下面的新计划

 Limit  (cost=1197580.62..1197582.26 rows=20 width=104) (actual time=11049.784..11057.060 rows=20 loops=1)
   Output: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))), (max(cfiles.last_modified)), (string_agg(DISTINCT (users.email)::text, ', '::text)), (string_agg(DISTINCT (groups.name)::text, ', '::text))
   ->  GroupAggregate  (cost=1197580.62..1313691.62 rows=1423800 width=104) (actual time=11049.783..11057.056 rows=20 loops=1)
         Output: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))), max(cfiles.last_modified), string_agg(DISTINCT (users.email)::text, ', '::text), string_agg(DISTINCT (groups.name)::text, ', '::text)
         Group Key: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text)))
         ->  Sort  (cost=1197580.62..1215107.62 rows=7010800 width=80) (actual time=11049.741..11051.064 rows=11067 loops=1)
               Output: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))), cfiles.last_modified, users.email, groups.name
               Sort Key: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))) DESC NULLS LAST
               Sort Method: external merge  Disk: 163248kB
               ->  ProjectSet  (cost=39.41..88894.93 rows=7010800 width=80) (actual time=0.314..1309.381 rows=3399832 loops=1)
                     Output: jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text)), cfiles.last_modified, users.email, groups.name
                     ->  Hash Left Join  (cost=39.41..53139.85 rows=70108 width=502) (actual time=0.230..413.324 rows=46031 loops=1)
                           Output: cfiles.property_values, cfiles.last_modified, users.email, groups.name
                           Hash Cond: (groups.id = user_groups.group_id)
                           ->  Nested Loop Left Join  (cost=37.55..51994.13 rows=44279 width=510) (actual time=0.218..405.437 rows=44535 loops=1)
                                 Output: cfiles.property_values, cfiles.last_modified, users.email, groups.name, groups.id
                                 Join Filter: ((user_permissions.cfile_id = cfiles.id) OR (user_permissions.dataset_id = datasets.id))
                                 Rows Removed by Join Filter: 2313897
                                 ->  Nested Loop Left Join  (cost=34.70..11695.59 rows=44279 width=500) (actual time=0.166..99.664 rows=44520 loops=1)
                                       Output: cfiles.property_values, cfiles.last_modified, cfiles.id, datasets.id, groups.name, groups.id
                                       Join Filter: ((group_permissions.cfile_id = cfiles.id) OR (group_permissions.dataset_id = datasets.id))
                                       Rows Removed by Join Filter: 396532
                                       ->  Hash Join  (cost=33.16..4718.97 rows=44279 width=478) (actual time=0.141..30.449 rows=44255 loops=1)
                                             Output: cfiles.property_values, cfiles.last_modified, cfiles.id, datasets.id
                                             Inner Unique: true
                                             Hash Cond: (cfiles.dataset_id = datasets.id)
                                             ->  Seq Scan on public.cfiles  (cost=0.00..4568.77 rows=44279 width=478) (actual time=0.016..12.724 rows=44255 loops=1)
                                                   Output: cfiles.id, cfiles.tid, cfiles.uuid, cfiles.dataset_id, cfiles.path, cfiles.name, cfiles.checksum, cfiles.size, cfiles.last_modified, cfiles.content_type, cfiles.locked, cfiles.property_values, cfiles.created_at, cfiles.updated_at
                                                   Filter: (cfiles.tid = 5)
                                                   Rows Removed by Filter: 1567
                                             ->  Hash  (cost=25.48..25.48 rows=614 width=8) (actual time=0.119..0.120 rows=614 loops=1)
                                                   Output: datasets.id
                                                   Buckets: 1024  Batches: 1  Memory Usage: 32kB
                                                   ->  Index Only Scan using datasets_pkey on public.datasets  (cost=0.28..25.48 rows=614 width=8) (actual time=0.010..0.057 rows=614 loops=1)
                                                         Output: datasets.id
                                                         Heap Fetches: 0
                                       ->  Materialize  (cost=1.54..2.70 rows=9 width=38) (actual time=0.000..0.001 rows=9 loops=44255)
                                             Output: group_permissions.cfile_id, group_permissions.dataset_id, groups.name, groups.id
                                             ->  Hash Left Join  (cost=1.54..2.65 rows=9 width=38) (actual time=0.015..0.021 rows=9 loops=1)
                                                   Output: group_permissions.cfile_id, group_permissions.dataset_id, groups.name, groups.id
                                                   Inner Unique: true
                                                   Hash Cond: (group_permissions.group_id = groups.id)
                                                   ->  Seq Scan on public.group_permissions  (cost=0.00..1.09 rows=9 width=24) (actual time=0.002..0.003 rows=9 loops=1)
                                                         Output: group_permissions.id, group_permissions.tid, group_permissions.group_id, group_permissions.dataset_id, group_permissions.cfile_id, group_permissions.read, group_permissions.share, group_permissions.write_meta, group_permissions.manage_files, group_permissions.delete_files, group_permissions.task_id, group_permissions.notified_at, group_permissions.first_downloaded_at, group_permissions.created_at, group_permissions.updated_at
                                                   ->  Hash  (cost=1.24..1.24 rows=24 width=22) (actual time=0.009..0.010 rows=24 loops=1)
                                                         Output: groups.name, groups.id
                                                         Buckets: 1024  Batches: 1  Memory Usage: 10kB
                                                         ->  Seq Scan on public.groups  (cost=0.00..1.24 rows=24 width=22) (actual time=0.003..0.006 rows=24 loops=1)
                                                               Output: groups.name, groups.id
                                 ->  Materialize  (cost=2.85..4.78 rows=52 width=42) (actual time=0.000..0.003 rows=52 loops=44520)
                                       Output: user_permissions.cfile_id, user_permissions.dataset_id, users.email
                                       ->  Hash Left Join  (cost=2.85..4.52 rows=52 width=42) (actual time=0.021..0.036 rows=52 loops=1)
                                             Output: user_permissions.cfile_id, user_permissions.dataset_id, users.email
                                             Inner Unique: true
                                             Hash Cond: (user_permissions.user_id = users.id)
                                             ->  Seq Scan on public.user_permissions  (cost=0.00..1.52 rows=52 width=24) (actual time=0.002..0.005 rows=52 loops=1)
                                                   Output: user_permissions.id, user_permissions.tid, user_permissions.user_id, user_permissions.dataset_id, user_permissions.cfile_id, user_permissions.read, user_permissions.share, user_permissions.write_meta, user_permissions.manage_files, user_permissions.delete_files, user_permissions.task_id, user_permissions.notified_at, user_permissions.first_downloaded_at, user_permissions.created_at, user_permissions.updated_at
                                             ->  Hash  (cost=2.38..2.38 rows=38 width=34) (actual time=0.016..0.017 rows=38 loops=1)
                                                   Output: users.email, users.id
                                                   Buckets: 1024  Batches: 1  Memory Usage: 11kB
                                                   ->  Seq Scan on public.users  (cost=0.00..2.38 rows=38 width=34) (actual time=0.002..0.010 rows=38 loops=1)
                                                         Output: users.email, users.id
                           ->  Hash  (cost=1.38..1.38 rows=38 width=8) (actual time=0.009..0.010 rows=38 loops=1)
                                 Output: user_groups.group_id
                                 Buckets: 1024  Batches: 1  Memory Usage: 10kB
                                 ->  Seq Scan on public.user_groups  (cost=0.00..1.38 rows=38 width=8) (actual time=0.002..0.005 rows=38 loops=1)
                                       Output: user_groups.group_id
 Planning Time: 0.990 ms
 Execution Time: 11081.013 ms
(69 rows)

Time: 11143.367 ms (00:11.143)

我认为您很难使用当前架构做得更好。你能规范化数据吗你有一个 table 每个 (tid,"Sample Names",id) 组合一行,或者每个唯一(“样本名称”)或每个 (tid,"Sample Names") 只一行?

虽然我不认为对“以及按来自其他加入的 table 的列进行过滤”有一个通用的答案。答案将取决于过滤器的选择性以及它是否可索引。