在子查询中使用 unnest() 时,Postgresql 分区修剪不起作用
Postgresql partition pruning not working when using unnest() in a subquery
在子查询中使用 unnest() 时,Postgresql (13.4) 无法提出使用执行时分区 p运行ing 的查询计划。
鉴于这些 table:
CREATE TABLE users (
user_id uuid,
channel_id uuid,
CONSTRAINT user_pk PRIMARY KEY(user_id, channel_id)
)
PARTITION BY hash(user_id);
CREATE TABLE users_0 PARTITION OF users FOR VALUES WITH (MODULUS 2, REMAINDER 0);
CREATE TABLE users_1 PARTITION OF users FOR VALUES WITH (MODULUS 2, REMAINDER 1);
CREATE TABLE channels (
channel_id uuid,
user_ids uuid[],
CONSTRAINT channel_pk PRIMARY KEY(channel_id)
) PARTITION BY hash(channel_id);
CREATE TABLE channels_0 partition of channels FOR VALUES WITH (MODULUS 2, REMAINDER 0);
CREATE TABLE channels_1 partition of channels FOR VALUES WITH (MODULUS 2, REMAINDER 1);
插入一些数据:
INSERT INTO users(user_id, channel_id) VALUES('0861180b-c972-42fe-9fb3-3b55e652f893', '45205876-7270-4e06-ab8d-b5f669298422');
INSERT INTO channels(channel_id, user_ids) VALUES('45205876-7270-4e06-ab8d-b5f669298422', '{0861180b-c972-42fe-9fb3-3b55e652f893}');
INSERT INTO users
SELECT
gen_random_uuid() as user_id,
gen_random_uuid() as channel_id
FROM generate_series(1, 100);
INSERT INTO channels
SELECT
(SELECT max(channel_id::text) FROM (SELECT channel_id FROM users ORDER BY random()*generate_series LIMIT 1) c)::uuid as channel_id,
(SELECT array_agg(DISTINCT user_id::text) FROM (SELECT user_id FROM users ORDER BY random()*generate_series
LIMIT 1) u)::uuid[] as user_ids
FROM (SELECT * FROM generate_series(1, 100)) g
ON conflict DO NOTHING;
以下查询:
EXPLAIN ANALYZE
SELECT * FROM users
WHERE user_id IN (
SELECT unnest(user_ids) FROM channels WHERE channel_id = '45205876-7270-4e06-ab8d-b5f669298422'
)
AND channel_id = '45205876-7270-4e06-ab8d-b5f669298422'
returns 扫描所有分区的查询计划。
Hash Semi Join (cost=8.45..37.28 rows=8 width=32) (actual time=0.208..0.387 rows=1 loops=1)
Hash Cond: (users.user_id = (unnest(channels.user_ids)))
-> Append (cost=0.00..28.71 rows=8 width=32) (actual time=0.037..0.134 rows=1 loops=1)
-> Seq Scan on users_0 users_1 (cost=0.00..27.00 rows=7 width=32) (actual time=0.021..0.041 rows=1 loops=1)
Filter: (channel_id = '45205876-7270-4e06-ab8d-b5f669298422'::uuid)
Rows Removed by Filter: 45
-> Seq Scan on users_1 users_2 (cost=0.00..1.68 rows=1 width=32) (actual time=0.018..0.027 rows=0 loops=1)
Filter: (channel_id = '45205876-7270-4e06-ab8d-b5f669298422'::uuid)
Rows Removed by Filter: 54
-> Hash (cost=8.33..8.33 rows=10 width=16) (actual time=0.131..0.172 rows=1 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> ProjectSet (cost=0.15..8.23 rows=10 width=16) (actual time=0.060..0.114 rows=1 loops=1)
-> Index Scan using channels_0_pkey on channels_0 channels (cost=0.15..8.17 rows=1 width=32) (actual time=0.040..0.059 rows=1 loops=1)
Index Cond: (channel_id = '45205876-7270-4e06-ab8d-b5f669298422'::uuid)
Planning Time: 0.363 ms
Execution Time: 0.515 ms
我希望 Postgresql 运行 子查询并查看返回的 user_id 以确定该数据将位于哪些分区。但是,Postgresql 正在为此调查所有分区数据。我试过在频道 table 中使用一行 pr user_id,效果很好。
EXPLAIN ANALYZE
SELECT * FROM users
WHERE user_id IN (
SELECT user_id FROM channels WHERE channel_id = '45205876-7270-4e06-ab8d-b5f669298422'
)
AND channel_id = '45205876-7270-4e06-ab8d-b5f669298422'
Postgresql 然后不会 运行 任何无法保存任何数据的分区的步骤。
似乎 unnest() 导致执行时间分区 p运行ing 不起作用。这是为什么?
解决方案:
我可以确认 jjanes 的解决方案。通过向 tables 添加 100k 行,使用 unnest() 的查询在执行时进行分区 p运行ing。
您的示例仅显示了它在一个查询中使用的内容,而不是它能够执行的所有操作。
你的 table 小得可笑。再往users里面放10000行table,这样索引其实很重要,看看有什么作用
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=1.85..151.99 rows=2 width=32) (actual time=0.040..0.042 rows=1 loops=1)
-> HashAggregate (cost=1.57..1.67 rows=10 width=16) (actual time=0.022..0.023 rows=1 loops=1)
Group Key: unnest(channels.user_ids)
Batches: 1 Memory Usage: 24kB
-> ProjectSet (cost=0.00..1.44 rows=10 width=16) (actual time=0.016..0.019 rows=1 loops=1)
-> Seq Scan on channels_0 channels (cost=0.00..1.39 rows=1 width=37) (actual time=0.013..0.016 rows=1 loops=1)
Filter: (channel_id = '45205876-7270-4e06-ab8d-b5f669298422'::uuid)
Rows Removed by Filter: 30
-> Append (cost=0.28..15.01 rows=2 width=32) (actual time=0.016..0.017 rows=1 loops=1)
-> Index Only Scan using users_0_pkey on users_0 users_1 (cost=0.28..7.50 rows=1 width=32) (actual time=0.014..0.015 rows=1 loops=1)
Index Cond: ((user_id = (unnest(channels.user_ids))) AND (channel_id = '45205876-7270-4e06-ab8d-b5f669298422'::uuid))
Heap Fetches: 1
-> Index Only Scan using users_1_pkey on users_1 users_2 (cost=0.28..7.50 rows=1 width=32) (never executed)
Index Cond: ((user_id = (unnest(channels.user_ids))) AND (channel_id = '45205876-7270-4e06-ab8d-b5f669298422'::uuid))
Heap Fetches: 0
Planning Time: 0.470 ms
Execution Time: 0.087 ms
(never executed)
是由于 execution-time 修剪。
在子查询中使用 unnest() 时,Postgresql (13.4) 无法提出使用执行时分区 p运行ing 的查询计划。
鉴于这些 table:
CREATE TABLE users (
user_id uuid,
channel_id uuid,
CONSTRAINT user_pk PRIMARY KEY(user_id, channel_id)
)
PARTITION BY hash(user_id);
CREATE TABLE users_0 PARTITION OF users FOR VALUES WITH (MODULUS 2, REMAINDER 0);
CREATE TABLE users_1 PARTITION OF users FOR VALUES WITH (MODULUS 2, REMAINDER 1);
CREATE TABLE channels (
channel_id uuid,
user_ids uuid[],
CONSTRAINT channel_pk PRIMARY KEY(channel_id)
) PARTITION BY hash(channel_id);
CREATE TABLE channels_0 partition of channels FOR VALUES WITH (MODULUS 2, REMAINDER 0);
CREATE TABLE channels_1 partition of channels FOR VALUES WITH (MODULUS 2, REMAINDER 1);
插入一些数据:
INSERT INTO users(user_id, channel_id) VALUES('0861180b-c972-42fe-9fb3-3b55e652f893', '45205876-7270-4e06-ab8d-b5f669298422');
INSERT INTO channels(channel_id, user_ids) VALUES('45205876-7270-4e06-ab8d-b5f669298422', '{0861180b-c972-42fe-9fb3-3b55e652f893}');
INSERT INTO users
SELECT
gen_random_uuid() as user_id,
gen_random_uuid() as channel_id
FROM generate_series(1, 100);
INSERT INTO channels
SELECT
(SELECT max(channel_id::text) FROM (SELECT channel_id FROM users ORDER BY random()*generate_series LIMIT 1) c)::uuid as channel_id,
(SELECT array_agg(DISTINCT user_id::text) FROM (SELECT user_id FROM users ORDER BY random()*generate_series
LIMIT 1) u)::uuid[] as user_ids
FROM (SELECT * FROM generate_series(1, 100)) g
ON conflict DO NOTHING;
以下查询:
EXPLAIN ANALYZE
SELECT * FROM users
WHERE user_id IN (
SELECT unnest(user_ids) FROM channels WHERE channel_id = '45205876-7270-4e06-ab8d-b5f669298422'
)
AND channel_id = '45205876-7270-4e06-ab8d-b5f669298422'
returns 扫描所有分区的查询计划。
Hash Semi Join (cost=8.45..37.28 rows=8 width=32) (actual time=0.208..0.387 rows=1 loops=1)
Hash Cond: (users.user_id = (unnest(channels.user_ids)))
-> Append (cost=0.00..28.71 rows=8 width=32) (actual time=0.037..0.134 rows=1 loops=1)
-> Seq Scan on users_0 users_1 (cost=0.00..27.00 rows=7 width=32) (actual time=0.021..0.041 rows=1 loops=1)
Filter: (channel_id = '45205876-7270-4e06-ab8d-b5f669298422'::uuid)
Rows Removed by Filter: 45
-> Seq Scan on users_1 users_2 (cost=0.00..1.68 rows=1 width=32) (actual time=0.018..0.027 rows=0 loops=1)
Filter: (channel_id = '45205876-7270-4e06-ab8d-b5f669298422'::uuid)
Rows Removed by Filter: 54
-> Hash (cost=8.33..8.33 rows=10 width=16) (actual time=0.131..0.172 rows=1 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> ProjectSet (cost=0.15..8.23 rows=10 width=16) (actual time=0.060..0.114 rows=1 loops=1)
-> Index Scan using channels_0_pkey on channels_0 channels (cost=0.15..8.17 rows=1 width=32) (actual time=0.040..0.059 rows=1 loops=1)
Index Cond: (channel_id = '45205876-7270-4e06-ab8d-b5f669298422'::uuid)
Planning Time: 0.363 ms
Execution Time: 0.515 ms
我希望 Postgresql 运行 子查询并查看返回的 user_id 以确定该数据将位于哪些分区。但是,Postgresql 正在为此调查所有分区数据。我试过在频道 table 中使用一行 pr user_id,效果很好。
EXPLAIN ANALYZE
SELECT * FROM users
WHERE user_id IN (
SELECT user_id FROM channels WHERE channel_id = '45205876-7270-4e06-ab8d-b5f669298422'
)
AND channel_id = '45205876-7270-4e06-ab8d-b5f669298422'
Postgresql 然后不会 运行 任何无法保存任何数据的分区的步骤。
似乎 unnest() 导致执行时间分区 p运行ing 不起作用。这是为什么?
解决方案: 我可以确认 jjanes 的解决方案。通过向 tables 添加 100k 行,使用 unnest() 的查询在执行时进行分区 p运行ing。
您的示例仅显示了它在一个查询中使用的内容,而不是它能够执行的所有操作。
你的 table 小得可笑。再往users里面放10000行table,这样索引其实很重要,看看有什么作用
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=1.85..151.99 rows=2 width=32) (actual time=0.040..0.042 rows=1 loops=1)
-> HashAggregate (cost=1.57..1.67 rows=10 width=16) (actual time=0.022..0.023 rows=1 loops=1)
Group Key: unnest(channels.user_ids)
Batches: 1 Memory Usage: 24kB
-> ProjectSet (cost=0.00..1.44 rows=10 width=16) (actual time=0.016..0.019 rows=1 loops=1)
-> Seq Scan on channels_0 channels (cost=0.00..1.39 rows=1 width=37) (actual time=0.013..0.016 rows=1 loops=1)
Filter: (channel_id = '45205876-7270-4e06-ab8d-b5f669298422'::uuid)
Rows Removed by Filter: 30
-> Append (cost=0.28..15.01 rows=2 width=32) (actual time=0.016..0.017 rows=1 loops=1)
-> Index Only Scan using users_0_pkey on users_0 users_1 (cost=0.28..7.50 rows=1 width=32) (actual time=0.014..0.015 rows=1 loops=1)
Index Cond: ((user_id = (unnest(channels.user_ids))) AND (channel_id = '45205876-7270-4e06-ab8d-b5f669298422'::uuid))
Heap Fetches: 1
-> Index Only Scan using users_1_pkey on users_1 users_2 (cost=0.28..7.50 rows=1 width=32) (never executed)
Index Cond: ((user_id = (unnest(channels.user_ids))) AND (channel_id = '45205876-7270-4e06-ab8d-b5f669298422'::uuid))
Heap Fetches: 0
Planning Time: 0.470 ms
Execution Time: 0.087 ms
(never executed)
是由于 execution-time 修剪。