优化 Postgres 中 jsonb 数组值的 GROUP BY
Optimizing GROUP BY of jsonb array values in Postgres
更新以获得更好的硬件
我在 Postgres12 中有以下简化模式,其中 dataset 有许多 cfiles 并且每个 cfile 有 property_values 存储为 jsonb:
SELECT * FROM cfiles;
id | dataset_id | property_values (jsonb)
----+------------+-----------------------------------------------
1 | 1 | {"Sample Names": ["SampA", "SampB"], (...other properties)}
2 | 1 | {"Sample Names": ["SampA", "SampC"], (...other properties)}
3 | 1 | {"Sample Names": ["SampD"], (...other properties)}
一些“示例名称”数组很大,可能包含多达 100 个短字符串(总共约 1K 个字符)
我正在使用下面的查询对“示例名称”jsonb 数组中的每个值进行分组。查询给出了我想要的结果,但在 ~45K 行 上需要大约 15 秒(在底部解释查询计划)运行在 aws t2.micro 上 1vCPU 和 1GB ram
SELECT
jsonb_array_elements_text(property_values -> 'Sample Names') as sample_names,
max(cfiles.last_modified) as last_modified,
string_agg(DISTINCT(users.email), ', ') as user_email,
string_agg(DISTINCT(groups.name), ', ') as group_name
FROM cfiles
JOIN datasets ON cfiles.dataset_id=datasets.id
LEFT JOIN user_permissions ON (user_permissions.cfile_id=cfiles.id OR user_permissions.dataset_id=datasets.id)
LEFT JOIN users on users.id=user_permissions.user_id
LEFT JOIN group_permissions ON (group_permissions.cfile_id=cfiles.id OR group_permissions.dataset_id=datasets.id)
LEFT JOIN groups ON groups.id=group_permissions.group_id
LEFT JOIN user_groups ON groups.id=user_groups.group_id
WHERE
cfiles.tid=5
-- Query needs to support sample name filtering eg:
-- AND property_values ->> 'Sample Names' like '%test%'
-- As well as filtering by columns from the other joined tables
GROUP BY sample_names
-- Below is required for pagination
ORDER BY "sample_names" desc NULLS LAST
所有 ID 列和连接列都已编入索引。
我曾尝试进一步减少查询,但发现它很棘手,因为很难将增量减少与改进相关联,而且我的最终结果需要包括所有内容。
只有没有任何连接的 cfiles 仍然需要大约 12 秒:
SELECT
jsonb_array_elements_text(property_values -> 'Sample Names') as sample_names,
max(cfiles.last_modified) as last_modified
FROM cfiles
WHERE
cfiles.tid=5
GROUP BY sample_names
ORDER BY "sample_names" desc NULLS LAST
LIMIT 20
OFFSET 0
我尝试在下面使用更多的 CTE 模式,但速度较慢并且没有产生正确的结果
WITH cf AS (
SELECT
cfiles.id as id,
cfiles.dataset_id as dataset_id,
jsonb_array_elements_text(property_values -> 'Sample Names') as sample_names,
cfiles.last_modified as last_modified
FROM cfiles
WHERE
cfiles.tid=5
GROUP BY sample_names, id, dataset_id, last_modified
ORDER BY "sample_names" desc NULLS LAST
LIMIT 20
OFFSET 0
)
SELECT cf.sample_names as sample_names,
max(cf.last_modified) as last_modified,
string_agg(DISTINCT(users.email), ', ') as user_email,
string_agg(DISTINCT(groups.name), ', ') as group_name
FROM cf
JOIN datasets ON cf.dataset_id=datasets.id
LEFT JOIN user_permissions ON (user_permissions.cfile_id=cf.id OR user_permissions.dataset_id=datasets.id)
LEFT JOIN users on users.id=user_permissions.user_id
LEFT JOIN group_permissions ON (group_permissions.cfile_id=cf.id OR group_permissions.dataset_id=datasets.id)
LEFT JOIN groups ON groups.id=group_permissions.group_id
LEFT JOIN user_groups ON groups.id=user_groups.group_id
GROUP BY sample_names
ORDER BY "sample_names" desc NULLS LAST
有没有其他方法可以将此查询的性能理想地缩短到几秒钟?我可以重新安排它,使用临时表和索引,以最有效的为准。
增加 RAM 是否有帮助?
更新详细解释分析:
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1046905.99..1046907.17 rows=20 width=104) (actual time=15394.929..15409.256 rows=20 loops=1)
Output: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))), (max(cfiles.last_modified)), (string_agg(DISTINCT (users.email)::text, ', '::text)), (string_agg(DISTINCT (groups.name)::text, ', '::text))
-> GroupAggregate (cost=1046905.99..1130738.74 rows=1426200 width=104) (actual time=15394.927..15409.228 rows=20 loops=1)
Output: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))), max(cfiles.last_modified), string_agg(DISTINCT (users.email)::text, ', '::text), string_agg(DISTINCT (groups.name)::text, ', '::text)
Group Key: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text)))
-> Sort (cost=1046905.99..1057967.74 rows=4424700 width=104) (actual time=15394.877..15400.483 rows=11067 loops=1)
Output: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))), cfiles.last_modified, users.email, groups.name
Sort Key: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))) DESC NULLS LAST
Sort Method: external merge Disk: 163288kB
-> ProjectSet (cost=41.40..74530.20 rows=4424700 width=104) (actual time=0.682..2933.628 rows=3399832 loops=1)
Output: jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text)), cfiles.last_modified, users.email, groups.name
-> Nested Loop Left Join (cost=41.40..51964.23 rows=44247 width=526) (actual time=0.587..1442.326 rows=46031 loops=1)
Output: cfiles.property_values, cfiles.last_modified, users.email, groups.name
Join Filter: ((user_permissions.cfile_id = cfiles.id) OR (user_permissions.dataset_id = datasets.id))
Rows Removed by Join Filter: 2391425
-> Nested Loop Left Join (cost=38.55..11694.81 rows=44247 width=510) (actual time=0.473..357.751 rows=46016 loops=1)
Output: cfiles.property_values, cfiles.last_modified, cfiles.id, datasets.id, groups.name
Join Filter: ((group_permissions.cfile_id = cfiles.id) OR (group_permissions.dataset_id = datasets.id))
Rows Removed by Join Filter: 616311
-> Hash Join (cost=35.81..4721.54 rows=44247 width=478) (actual time=0.388..50.189 rows=44255 loops=1)
Output: cfiles.property_values, cfiles.last_modified, cfiles.id, datasets.id
Inner Unique: true
Hash Cond: (cfiles.dataset_id = datasets.id)
-> Seq Scan on public.cfiles (cost=0.00..4568.77 rows=44247 width=478) (actual time=0.012..20.676 rows=44255 loops=1)
Output: cfiles.id, cfiles.tid, cfiles.uuid, cfiles.dataset_id, cfiles.path, cfiles.name, cfiles.checksum, cfiles.size, cfiles.last_modified, cfiles.content_type, cfiles.locked, cfiles.property_values, cfiles.created_at, cfiles.updated_at
Filter: (cfiles.tid = 5)
Rows Removed by Filter: 1567
-> Hash (cost=28.14..28.14 rows=614 width=8) (actual time=0.363..0.363 rows=614 loops=1)
Output: datasets.id
Buckets: 1024 Batches: 1 Memory Usage: 32kB
-> Seq Scan on public.datasets (cost=0.00..28.14 rows=614 width=8) (actual time=0.004..0.194 rows=614 loops=1)
Output: datasets.id
-> Materialize (cost=2.74..4.39 rows=9 width=48) (actual time=0.000..0.003 rows=14 loops=44255)
Output: group_permissions.cfile_id, group_permissions.dataset_id, groups.name
-> Hash Right Join (cost=2.74..4.35 rows=9 width=48) (actual time=0.049..0.071 rows=14 loops=1)
Output: group_permissions.cfile_id, group_permissions.dataset_id, groups.name
Hash Cond: (user_groups.group_id = groups.id)
-> Seq Scan on public.user_groups (cost=0.00..1.38 rows=38 width=8) (actual time=0.003..0.011 rows=38 loops=1)
Output: user_groups.id, user_groups.tid, user_groups.user_id, user_groups.group_id, user_groups.created_at, user_groups.updated_at
-> Hash (cost=2.62..2.62 rows=9 width=56) (actual time=0.039..0.039 rows=9 loops=1)
Output: group_permissions.cfile_id, group_permissions.dataset_id, groups.name, groups.id
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Hash Right Join (cost=1.20..2.62 rows=9 width=56) (actual time=0.021..0.035 rows=9 loops=1)
Output: group_permissions.cfile_id, group_permissions.dataset_id, groups.name, groups.id
Hash Cond: (groups.id = group_permissions.group_id)
-> Seq Scan on public.groups (cost=0.00..1.24 rows=24 width=40) (actual time=0.003..0.008 rows=24 loops=1)
Output: groups.id, groups.tid, groups.name, groups.description, groups.default_uview, groups.created_at, groups.updated_at
-> Hash (cost=1.09..1.09 rows=9 width=24) (actual time=0.010..0.010 rows=9 loops=1)
Output: group_permissions.cfile_id, group_permissions.dataset_id, group_permissions.group_id
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on public.group_permissions (cost=0.00..1.09 rows=9 width=24) (actual time=0.003..0.006 rows=9 loops=1)
Output: group_permissions.cfile_id, group_permissions.dataset_id, group_permissions.group_id
-> Materialize (cost=2.85..4.78 rows=52 width=48) (actual time=0.000..0.010 rows=52 loops=46016)
Output: user_permissions.cfile_id, user_permissions.dataset_id, users.email
-> Hash Left Join (cost=2.85..4.52 rows=52 width=48) (actual time=0.037..0.076 rows=52 loops=1)
Output: user_permissions.cfile_id, user_permissions.dataset_id, users.email
Inner Unique: true
Hash Cond: (user_permissions.user_id = users.id)
-> Seq Scan on public.user_permissions (cost=0.00..1.52 rows=52 width=24) (actual time=0.003..0.014 rows=52 loops=1)
Output: user_permissions.id, user_permissions.tid, user_permissions.user_id, user_permissions.dataset_id, user_permissions.cfile_id, user_permissions.read, user_permissions.share, user_permissions.write_meta, user_permissions.manage_files, user_permissions.delete_files, user_permissions.task_id, user_permissions.notified_at, user_permissions.first_downloaded_at, user_permissions.created_at, user_permissions.updated_at
-> Hash (cost=2.38..2.38 rows=38 width=40) (actual time=0.029..0.030 rows=38 loops=1)
Output: users.email, users.id
Buckets: 1024 Batches: 1 Memory Usage: 11kB
-> Seq Scan on public.users (cost=0.00..2.38 rows=38 width=40) (actual time=0.004..0.016 rows=38 loops=1)
Output: users.email, users.id
Planning Time: 0.918 ms
Execution Time: 15442.689 ms
(67 rows)
Time: 15575.174 ms (00:15.575)
更新 2
将 work_mem 增加到 256mb 后,没有任何连接查询的 cfiles 从 12 秒下降到 2 秒,但完整查询仍为 11 秒 - 下面的新计划
Limit (cost=1197580.62..1197582.26 rows=20 width=104) (actual time=11049.784..11057.060 rows=20 loops=1)
Output: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))), (max(cfiles.last_modified)), (string_agg(DISTINCT (users.email)::text, ', '::text)), (string_agg(DISTINCT (groups.name)::text, ', '::text))
-> GroupAggregate (cost=1197580.62..1313691.62 rows=1423800 width=104) (actual time=11049.783..11057.056 rows=20 loops=1)
Output: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))), max(cfiles.last_modified), string_agg(DISTINCT (users.email)::text, ', '::text), string_agg(DISTINCT (groups.name)::text, ', '::text)
Group Key: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text)))
-> Sort (cost=1197580.62..1215107.62 rows=7010800 width=80) (actual time=11049.741..11051.064 rows=11067 loops=1)
Output: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))), cfiles.last_modified, users.email, groups.name
Sort Key: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))) DESC NULLS LAST
Sort Method: external merge Disk: 163248kB
-> ProjectSet (cost=39.41..88894.93 rows=7010800 width=80) (actual time=0.314..1309.381 rows=3399832 loops=1)
Output: jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text)), cfiles.last_modified, users.email, groups.name
-> Hash Left Join (cost=39.41..53139.85 rows=70108 width=502) (actual time=0.230..413.324 rows=46031 loops=1)
Output: cfiles.property_values, cfiles.last_modified, users.email, groups.name
Hash Cond: (groups.id = user_groups.group_id)
-> Nested Loop Left Join (cost=37.55..51994.13 rows=44279 width=510) (actual time=0.218..405.437 rows=44535 loops=1)
Output: cfiles.property_values, cfiles.last_modified, users.email, groups.name, groups.id
Join Filter: ((user_permissions.cfile_id = cfiles.id) OR (user_permissions.dataset_id = datasets.id))
Rows Removed by Join Filter: 2313897
-> Nested Loop Left Join (cost=34.70..11695.59 rows=44279 width=500) (actual time=0.166..99.664 rows=44520 loops=1)
Output: cfiles.property_values, cfiles.last_modified, cfiles.id, datasets.id, groups.name, groups.id
Join Filter: ((group_permissions.cfile_id = cfiles.id) OR (group_permissions.dataset_id = datasets.id))
Rows Removed by Join Filter: 396532
-> Hash Join (cost=33.16..4718.97 rows=44279 width=478) (actual time=0.141..30.449 rows=44255 loops=1)
Output: cfiles.property_values, cfiles.last_modified, cfiles.id, datasets.id
Inner Unique: true
Hash Cond: (cfiles.dataset_id = datasets.id)
-> Seq Scan on public.cfiles (cost=0.00..4568.77 rows=44279 width=478) (actual time=0.016..12.724 rows=44255 loops=1)
Output: cfiles.id, cfiles.tid, cfiles.uuid, cfiles.dataset_id, cfiles.path, cfiles.name, cfiles.checksum, cfiles.size, cfiles.last_modified, cfiles.content_type, cfiles.locked, cfiles.property_values, cfiles.created_at, cfiles.updated_at
Filter: (cfiles.tid = 5)
Rows Removed by Filter: 1567
-> Hash (cost=25.48..25.48 rows=614 width=8) (actual time=0.119..0.120 rows=614 loops=1)
Output: datasets.id
Buckets: 1024 Batches: 1 Memory Usage: 32kB
-> Index Only Scan using datasets_pkey on public.datasets (cost=0.28..25.48 rows=614 width=8) (actual time=0.010..0.057 rows=614 loops=1)
Output: datasets.id
Heap Fetches: 0
-> Materialize (cost=1.54..2.70 rows=9 width=38) (actual time=0.000..0.001 rows=9 loops=44255)
Output: group_permissions.cfile_id, group_permissions.dataset_id, groups.name, groups.id
-> Hash Left Join (cost=1.54..2.65 rows=9 width=38) (actual time=0.015..0.021 rows=9 loops=1)
Output: group_permissions.cfile_id, group_permissions.dataset_id, groups.name, groups.id
Inner Unique: true
Hash Cond: (group_permissions.group_id = groups.id)
-> Seq Scan on public.group_permissions (cost=0.00..1.09 rows=9 width=24) (actual time=0.002..0.003 rows=9 loops=1)
Output: group_permissions.id, group_permissions.tid, group_permissions.group_id, group_permissions.dataset_id, group_permissions.cfile_id, group_permissions.read, group_permissions.share, group_permissions.write_meta, group_permissions.manage_files, group_permissions.delete_files, group_permissions.task_id, group_permissions.notified_at, group_permissions.first_downloaded_at, group_permissions.created_at, group_permissions.updated_at
-> Hash (cost=1.24..1.24 rows=24 width=22) (actual time=0.009..0.010 rows=24 loops=1)
Output: groups.name, groups.id
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> Seq Scan on public.groups (cost=0.00..1.24 rows=24 width=22) (actual time=0.003..0.006 rows=24 loops=1)
Output: groups.name, groups.id
-> Materialize (cost=2.85..4.78 rows=52 width=42) (actual time=0.000..0.003 rows=52 loops=44520)
Output: user_permissions.cfile_id, user_permissions.dataset_id, users.email
-> Hash Left Join (cost=2.85..4.52 rows=52 width=42) (actual time=0.021..0.036 rows=52 loops=1)
Output: user_permissions.cfile_id, user_permissions.dataset_id, users.email
Inner Unique: true
Hash Cond: (user_permissions.user_id = users.id)
-> Seq Scan on public.user_permissions (cost=0.00..1.52 rows=52 width=24) (actual time=0.002..0.005 rows=52 loops=1)
Output: user_permissions.id, user_permissions.tid, user_permissions.user_id, user_permissions.dataset_id, user_permissions.cfile_id, user_permissions.read, user_permissions.share, user_permissions.write_meta, user_permissions.manage_files, user_permissions.delete_files, user_permissions.task_id, user_permissions.notified_at, user_permissions.first_downloaded_at, user_permissions.created_at, user_permissions.updated_at
-> Hash (cost=2.38..2.38 rows=38 width=34) (actual time=0.016..0.017 rows=38 loops=1)
Output: users.email, users.id
Buckets: 1024 Batches: 1 Memory Usage: 11kB
-> Seq Scan on public.users (cost=0.00..2.38 rows=38 width=34) (actual time=0.002..0.010 rows=38 loops=1)
Output: users.email, users.id
-> Hash (cost=1.38..1.38 rows=38 width=8) (actual time=0.009..0.010 rows=38 loops=1)
Output: user_groups.group_id
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> Seq Scan on public.user_groups (cost=0.00..1.38 rows=38 width=8) (actual time=0.002..0.005 rows=38 loops=1)
Output: user_groups.group_id
Planning Time: 0.990 ms
Execution Time: 11081.013 ms
(69 rows)
Time: 11143.367 ms (00:11.143)
我认为您很难使用当前架构做得更好。你能规范化数据吗你有一个 table 每个 (tid,"Sample Names",id)
组合一行,或者每个唯一(“样本名称”)或每个 (tid,"Sample Names")
只一行?
虽然我不认为对“以及按来自其他加入的 table 的列进行过滤”有一个通用的答案。答案将取决于过滤器的选择性以及它是否可索引。
更新以获得更好的硬件
我在 Postgres12 中有以下简化模式,其中 dataset 有许多 cfiles 并且每个 cfile 有 property_values 存储为 jsonb:
SELECT * FROM cfiles;
id | dataset_id | property_values (jsonb)
----+------------+-----------------------------------------------
1 | 1 | {"Sample Names": ["SampA", "SampB"], (...other properties)}
2 | 1 | {"Sample Names": ["SampA", "SampC"], (...other properties)}
3 | 1 | {"Sample Names": ["SampD"], (...other properties)}
一些“示例名称”数组很大,可能包含多达 100 个短字符串(总共约 1K 个字符)
我正在使用下面的查询对“示例名称”jsonb 数组中的每个值进行分组。查询给出了我想要的结果,但在 ~45K 行 上需要大约 15 秒(在底部解释查询计划)运行在 aws t2.micro 上 1vCPU 和 1GB ram
SELECT
jsonb_array_elements_text(property_values -> 'Sample Names') as sample_names,
max(cfiles.last_modified) as last_modified,
string_agg(DISTINCT(users.email), ', ') as user_email,
string_agg(DISTINCT(groups.name), ', ') as group_name
FROM cfiles
JOIN datasets ON cfiles.dataset_id=datasets.id
LEFT JOIN user_permissions ON (user_permissions.cfile_id=cfiles.id OR user_permissions.dataset_id=datasets.id)
LEFT JOIN users on users.id=user_permissions.user_id
LEFT JOIN group_permissions ON (group_permissions.cfile_id=cfiles.id OR group_permissions.dataset_id=datasets.id)
LEFT JOIN groups ON groups.id=group_permissions.group_id
LEFT JOIN user_groups ON groups.id=user_groups.group_id
WHERE
cfiles.tid=5
-- Query needs to support sample name filtering eg:
-- AND property_values ->> 'Sample Names' like '%test%'
-- As well as filtering by columns from the other joined tables
GROUP BY sample_names
-- Below is required for pagination
ORDER BY "sample_names" desc NULLS LAST
所有 ID 列和连接列都已编入索引。
我曾尝试进一步减少查询,但发现它很棘手,因为很难将增量减少与改进相关联,而且我的最终结果需要包括所有内容。
只有没有任何连接的 cfiles 仍然需要大约 12 秒:
SELECT
jsonb_array_elements_text(property_values -> 'Sample Names') as sample_names,
max(cfiles.last_modified) as last_modified
FROM cfiles
WHERE
cfiles.tid=5
GROUP BY sample_names
ORDER BY "sample_names" desc NULLS LAST
LIMIT 20
OFFSET 0
我尝试在下面使用更多的 CTE 模式,但速度较慢并且没有产生正确的结果
WITH cf AS (
SELECT
cfiles.id as id,
cfiles.dataset_id as dataset_id,
jsonb_array_elements_text(property_values -> 'Sample Names') as sample_names,
cfiles.last_modified as last_modified
FROM cfiles
WHERE
cfiles.tid=5
GROUP BY sample_names, id, dataset_id, last_modified
ORDER BY "sample_names" desc NULLS LAST
LIMIT 20
OFFSET 0
)
SELECT cf.sample_names as sample_names,
max(cf.last_modified) as last_modified,
string_agg(DISTINCT(users.email), ', ') as user_email,
string_agg(DISTINCT(groups.name), ', ') as group_name
FROM cf
JOIN datasets ON cf.dataset_id=datasets.id
LEFT JOIN user_permissions ON (user_permissions.cfile_id=cf.id OR user_permissions.dataset_id=datasets.id)
LEFT JOIN users on users.id=user_permissions.user_id
LEFT JOIN group_permissions ON (group_permissions.cfile_id=cf.id OR group_permissions.dataset_id=datasets.id)
LEFT JOIN groups ON groups.id=group_permissions.group_id
LEFT JOIN user_groups ON groups.id=user_groups.group_id
GROUP BY sample_names
ORDER BY "sample_names" desc NULLS LAST
有没有其他方法可以将此查询的性能理想地缩短到几秒钟?我可以重新安排它,使用临时表和索引,以最有效的为准。
增加 RAM 是否有帮助?
更新详细解释分析:
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1046905.99..1046907.17 rows=20 width=104) (actual time=15394.929..15409.256 rows=20 loops=1)
Output: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))), (max(cfiles.last_modified)), (string_agg(DISTINCT (users.email)::text, ', '::text)), (string_agg(DISTINCT (groups.name)::text, ', '::text))
-> GroupAggregate (cost=1046905.99..1130738.74 rows=1426200 width=104) (actual time=15394.927..15409.228 rows=20 loops=1)
Output: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))), max(cfiles.last_modified), string_agg(DISTINCT (users.email)::text, ', '::text), string_agg(DISTINCT (groups.name)::text, ', '::text)
Group Key: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text)))
-> Sort (cost=1046905.99..1057967.74 rows=4424700 width=104) (actual time=15394.877..15400.483 rows=11067 loops=1)
Output: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))), cfiles.last_modified, users.email, groups.name
Sort Key: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))) DESC NULLS LAST
Sort Method: external merge Disk: 163288kB
-> ProjectSet (cost=41.40..74530.20 rows=4424700 width=104) (actual time=0.682..2933.628 rows=3399832 loops=1)
Output: jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text)), cfiles.last_modified, users.email, groups.name
-> Nested Loop Left Join (cost=41.40..51964.23 rows=44247 width=526) (actual time=0.587..1442.326 rows=46031 loops=1)
Output: cfiles.property_values, cfiles.last_modified, users.email, groups.name
Join Filter: ((user_permissions.cfile_id = cfiles.id) OR (user_permissions.dataset_id = datasets.id))
Rows Removed by Join Filter: 2391425
-> Nested Loop Left Join (cost=38.55..11694.81 rows=44247 width=510) (actual time=0.473..357.751 rows=46016 loops=1)
Output: cfiles.property_values, cfiles.last_modified, cfiles.id, datasets.id, groups.name
Join Filter: ((group_permissions.cfile_id = cfiles.id) OR (group_permissions.dataset_id = datasets.id))
Rows Removed by Join Filter: 616311
-> Hash Join (cost=35.81..4721.54 rows=44247 width=478) (actual time=0.388..50.189 rows=44255 loops=1)
Output: cfiles.property_values, cfiles.last_modified, cfiles.id, datasets.id
Inner Unique: true
Hash Cond: (cfiles.dataset_id = datasets.id)
-> Seq Scan on public.cfiles (cost=0.00..4568.77 rows=44247 width=478) (actual time=0.012..20.676 rows=44255 loops=1)
Output: cfiles.id, cfiles.tid, cfiles.uuid, cfiles.dataset_id, cfiles.path, cfiles.name, cfiles.checksum, cfiles.size, cfiles.last_modified, cfiles.content_type, cfiles.locked, cfiles.property_values, cfiles.created_at, cfiles.updated_at
Filter: (cfiles.tid = 5)
Rows Removed by Filter: 1567
-> Hash (cost=28.14..28.14 rows=614 width=8) (actual time=0.363..0.363 rows=614 loops=1)
Output: datasets.id
Buckets: 1024 Batches: 1 Memory Usage: 32kB
-> Seq Scan on public.datasets (cost=0.00..28.14 rows=614 width=8) (actual time=0.004..0.194 rows=614 loops=1)
Output: datasets.id
-> Materialize (cost=2.74..4.39 rows=9 width=48) (actual time=0.000..0.003 rows=14 loops=44255)
Output: group_permissions.cfile_id, group_permissions.dataset_id, groups.name
-> Hash Right Join (cost=2.74..4.35 rows=9 width=48) (actual time=0.049..0.071 rows=14 loops=1)
Output: group_permissions.cfile_id, group_permissions.dataset_id, groups.name
Hash Cond: (user_groups.group_id = groups.id)
-> Seq Scan on public.user_groups (cost=0.00..1.38 rows=38 width=8) (actual time=0.003..0.011 rows=38 loops=1)
Output: user_groups.id, user_groups.tid, user_groups.user_id, user_groups.group_id, user_groups.created_at, user_groups.updated_at
-> Hash (cost=2.62..2.62 rows=9 width=56) (actual time=0.039..0.039 rows=9 loops=1)
Output: group_permissions.cfile_id, group_permissions.dataset_id, groups.name, groups.id
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Hash Right Join (cost=1.20..2.62 rows=9 width=56) (actual time=0.021..0.035 rows=9 loops=1)
Output: group_permissions.cfile_id, group_permissions.dataset_id, groups.name, groups.id
Hash Cond: (groups.id = group_permissions.group_id)
-> Seq Scan on public.groups (cost=0.00..1.24 rows=24 width=40) (actual time=0.003..0.008 rows=24 loops=1)
Output: groups.id, groups.tid, groups.name, groups.description, groups.default_uview, groups.created_at, groups.updated_at
-> Hash (cost=1.09..1.09 rows=9 width=24) (actual time=0.010..0.010 rows=9 loops=1)
Output: group_permissions.cfile_id, group_permissions.dataset_id, group_permissions.group_id
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on public.group_permissions (cost=0.00..1.09 rows=9 width=24) (actual time=0.003..0.006 rows=9 loops=1)
Output: group_permissions.cfile_id, group_permissions.dataset_id, group_permissions.group_id
-> Materialize (cost=2.85..4.78 rows=52 width=48) (actual time=0.000..0.010 rows=52 loops=46016)
Output: user_permissions.cfile_id, user_permissions.dataset_id, users.email
-> Hash Left Join (cost=2.85..4.52 rows=52 width=48) (actual time=0.037..0.076 rows=52 loops=1)
Output: user_permissions.cfile_id, user_permissions.dataset_id, users.email
Inner Unique: true
Hash Cond: (user_permissions.user_id = users.id)
-> Seq Scan on public.user_permissions (cost=0.00..1.52 rows=52 width=24) (actual time=0.003..0.014 rows=52 loops=1)
Output: user_permissions.id, user_permissions.tid, user_permissions.user_id, user_permissions.dataset_id, user_permissions.cfile_id, user_permissions.read, user_permissions.share, user_permissions.write_meta, user_permissions.manage_files, user_permissions.delete_files, user_permissions.task_id, user_permissions.notified_at, user_permissions.first_downloaded_at, user_permissions.created_at, user_permissions.updated_at
-> Hash (cost=2.38..2.38 rows=38 width=40) (actual time=0.029..0.030 rows=38 loops=1)
Output: users.email, users.id
Buckets: 1024 Batches: 1 Memory Usage: 11kB
-> Seq Scan on public.users (cost=0.00..2.38 rows=38 width=40) (actual time=0.004..0.016 rows=38 loops=1)
Output: users.email, users.id
Planning Time: 0.918 ms
Execution Time: 15442.689 ms
(67 rows)
Time: 15575.174 ms (00:15.575)
更新 2
将 work_mem 增加到 256mb 后,没有任何连接查询的 cfiles 从 12 秒下降到 2 秒,但完整查询仍为 11 秒 - 下面的新计划
Limit (cost=1197580.62..1197582.26 rows=20 width=104) (actual time=11049.784..11057.060 rows=20 loops=1)
Output: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))), (max(cfiles.last_modified)), (string_agg(DISTINCT (users.email)::text, ', '::text)), (string_agg(DISTINCT (groups.name)::text, ', '::text))
-> GroupAggregate (cost=1197580.62..1313691.62 rows=1423800 width=104) (actual time=11049.783..11057.056 rows=20 loops=1)
Output: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))), max(cfiles.last_modified), string_agg(DISTINCT (users.email)::text, ', '::text), string_agg(DISTINCT (groups.name)::text, ', '::text)
Group Key: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text)))
-> Sort (cost=1197580.62..1215107.62 rows=7010800 width=80) (actual time=11049.741..11051.064 rows=11067 loops=1)
Output: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))), cfiles.last_modified, users.email, groups.name
Sort Key: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))) DESC NULLS LAST
Sort Method: external merge Disk: 163248kB
-> ProjectSet (cost=39.41..88894.93 rows=7010800 width=80) (actual time=0.314..1309.381 rows=3399832 loops=1)
Output: jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text)), cfiles.last_modified, users.email, groups.name
-> Hash Left Join (cost=39.41..53139.85 rows=70108 width=502) (actual time=0.230..413.324 rows=46031 loops=1)
Output: cfiles.property_values, cfiles.last_modified, users.email, groups.name
Hash Cond: (groups.id = user_groups.group_id)
-> Nested Loop Left Join (cost=37.55..51994.13 rows=44279 width=510) (actual time=0.218..405.437 rows=44535 loops=1)
Output: cfiles.property_values, cfiles.last_modified, users.email, groups.name, groups.id
Join Filter: ((user_permissions.cfile_id = cfiles.id) OR (user_permissions.dataset_id = datasets.id))
Rows Removed by Join Filter: 2313897
-> Nested Loop Left Join (cost=34.70..11695.59 rows=44279 width=500) (actual time=0.166..99.664 rows=44520 loops=1)
Output: cfiles.property_values, cfiles.last_modified, cfiles.id, datasets.id, groups.name, groups.id
Join Filter: ((group_permissions.cfile_id = cfiles.id) OR (group_permissions.dataset_id = datasets.id))
Rows Removed by Join Filter: 396532
-> Hash Join (cost=33.16..4718.97 rows=44279 width=478) (actual time=0.141..30.449 rows=44255 loops=1)
Output: cfiles.property_values, cfiles.last_modified, cfiles.id, datasets.id
Inner Unique: true
Hash Cond: (cfiles.dataset_id = datasets.id)
-> Seq Scan on public.cfiles (cost=0.00..4568.77 rows=44279 width=478) (actual time=0.016..12.724 rows=44255 loops=1)
Output: cfiles.id, cfiles.tid, cfiles.uuid, cfiles.dataset_id, cfiles.path, cfiles.name, cfiles.checksum, cfiles.size, cfiles.last_modified, cfiles.content_type, cfiles.locked, cfiles.property_values, cfiles.created_at, cfiles.updated_at
Filter: (cfiles.tid = 5)
Rows Removed by Filter: 1567
-> Hash (cost=25.48..25.48 rows=614 width=8) (actual time=0.119..0.120 rows=614 loops=1)
Output: datasets.id
Buckets: 1024 Batches: 1 Memory Usage: 32kB
-> Index Only Scan using datasets_pkey on public.datasets (cost=0.28..25.48 rows=614 width=8) (actual time=0.010..0.057 rows=614 loops=1)
Output: datasets.id
Heap Fetches: 0
-> Materialize (cost=1.54..2.70 rows=9 width=38) (actual time=0.000..0.001 rows=9 loops=44255)
Output: group_permissions.cfile_id, group_permissions.dataset_id, groups.name, groups.id
-> Hash Left Join (cost=1.54..2.65 rows=9 width=38) (actual time=0.015..0.021 rows=9 loops=1)
Output: group_permissions.cfile_id, group_permissions.dataset_id, groups.name, groups.id
Inner Unique: true
Hash Cond: (group_permissions.group_id = groups.id)
-> Seq Scan on public.group_permissions (cost=0.00..1.09 rows=9 width=24) (actual time=0.002..0.003 rows=9 loops=1)
Output: group_permissions.id, group_permissions.tid, group_permissions.group_id, group_permissions.dataset_id, group_permissions.cfile_id, group_permissions.read, group_permissions.share, group_permissions.write_meta, group_permissions.manage_files, group_permissions.delete_files, group_permissions.task_id, group_permissions.notified_at, group_permissions.first_downloaded_at, group_permissions.created_at, group_permissions.updated_at
-> Hash (cost=1.24..1.24 rows=24 width=22) (actual time=0.009..0.010 rows=24 loops=1)
Output: groups.name, groups.id
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> Seq Scan on public.groups (cost=0.00..1.24 rows=24 width=22) (actual time=0.003..0.006 rows=24 loops=1)
Output: groups.name, groups.id
-> Materialize (cost=2.85..4.78 rows=52 width=42) (actual time=0.000..0.003 rows=52 loops=44520)
Output: user_permissions.cfile_id, user_permissions.dataset_id, users.email
-> Hash Left Join (cost=2.85..4.52 rows=52 width=42) (actual time=0.021..0.036 rows=52 loops=1)
Output: user_permissions.cfile_id, user_permissions.dataset_id, users.email
Inner Unique: true
Hash Cond: (user_permissions.user_id = users.id)
-> Seq Scan on public.user_permissions (cost=0.00..1.52 rows=52 width=24) (actual time=0.002..0.005 rows=52 loops=1)
Output: user_permissions.id, user_permissions.tid, user_permissions.user_id, user_permissions.dataset_id, user_permissions.cfile_id, user_permissions.read, user_permissions.share, user_permissions.write_meta, user_permissions.manage_files, user_permissions.delete_files, user_permissions.task_id, user_permissions.notified_at, user_permissions.first_downloaded_at, user_permissions.created_at, user_permissions.updated_at
-> Hash (cost=2.38..2.38 rows=38 width=34) (actual time=0.016..0.017 rows=38 loops=1)
Output: users.email, users.id
Buckets: 1024 Batches: 1 Memory Usage: 11kB
-> Seq Scan on public.users (cost=0.00..2.38 rows=38 width=34) (actual time=0.002..0.010 rows=38 loops=1)
Output: users.email, users.id
-> Hash (cost=1.38..1.38 rows=38 width=8) (actual time=0.009..0.010 rows=38 loops=1)
Output: user_groups.group_id
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> Seq Scan on public.user_groups (cost=0.00..1.38 rows=38 width=8) (actual time=0.002..0.005 rows=38 loops=1)
Output: user_groups.group_id
Planning Time: 0.990 ms
Execution Time: 11081.013 ms
(69 rows)
Time: 11143.367 ms (00:11.143)
我认为您很难使用当前架构做得更好。你能规范化数据吗你有一个 table 每个 (tid,"Sample Names",id)
组合一行,或者每个唯一(“样本名称”)或每个 (tid,"Sample Names")
只一行?
虽然我不认为对“以及按来自其他加入的 table 的列进行过滤”有一个通用的答案。答案将取决于过滤器的选择性以及它是否可索引。