JSON 值的模式匹配:实体化视图上的慢速 EXISTS 子查询
Pattern-matching for JSON values: slow EXISTS subquery on materialized view
运行 Postgres 12.5 的本地 docker 实例(4MB work_mem)。
我正在实施 this pattern 来搜索 json 中的任意字段。目标是快速搜索 return JSON 列 profile
:
CREATE TABLE end_user (
id varchar NOT NULL,
environment_id varchar NOT NULL,
profile jsonb NOT NULL DEFAULT '{}'::jsonb,
CONSTRAINT end_user_pkey PRIMARY KEY (environment_id, id)
);
CREATE INDEX end_user_environment_id_idx ON private.end_user USING btree (environment_id);
CREATE INDEX end_user_id_idx ON private.end_user USING btree (id);
CREATE INDEX end_user_profile_idx ON private.end_user USING gin (profile);
CREATE MATERIALIZED VIEW user_profiles AS
SELECT u.environment_id, u.id, j.key, j.value
FROM end_user u, jsonb_each_text(u.profile) j(key, value);
CREATE UNIQUE INDEX on user_profiles (environment_id, id, key);
CREATE INDEX user_profile_trgm_idx ON user_profiles using gin (value gin_trgm_ops);
我有一个 indexed correctly 的查询,因此它可以在几毫秒内执行一百万行。 ✅
select * from user_profiles
where value ilike '%auckland%' and key = 'timezone' and environment_id = 'test';
执行时间 42ms
Bitmap Heap Scan on user_profiles (cost=28935.65..62591.44 rows=9659 width=65)
Recheck Cond: ((value ~~* '%auckland%'::text) AND (key = 'timezone'::text))
Filter: ((environment_id)::text = 'test'::text)
-> BitmapAnd (cost=28935.65..28935.65 rows=9659 width=0)
-> Bitmap Index Scan on user_profile_trgm_idx (cost=0.00..2923.95 rows=320526 width=0)
Index Cond: (value ~~* '%auckland%'::text)
-> Bitmap Index Scan on user_profiles_key_idx (cost=0.00..26006.62 rows=994408 width=0)
Index Cond: (key = 'timezone'::text)
但是,如果我将它与 exists
查询一起使用以建立如下条件:
select * users u
where
environment_id = 'test'
and exists (
select 1 from user_profiles p
where
value ilike '%auckland%'
and key = 'timezone'
and p.id = u.id
and environment_id = 'test'
)
执行速度很慢。
执行时间 17.44 秒
Nested Loop (cost=62616.01..124606.45 rows=9658 width=1459) (actual time=19206.818..28444.491 rows=332572 loops=1)
Buffers: shared hit=952734 read=624101
-> HashAggregate (cost=62615.59..62707.52 rows=9193 width=15) (actual time=19205.238..19292.998 rows=332572 loops=1)
Group Key: (p.id)::text
Buffers: shared hit=373 read=246174
-> Bitmap Heap Scan on user_profiles p (cost=28935.65..62591.44 rows=9659 width=15) (actual time=278.211..18942.629 rows=332572 loops=1)
Recheck Cond: ((value ~~* '%auckland%'::text) AND (key = 'timezone'::text))
Rows Removed by Index Recheck: 17781109
Filter: ((environment_id)::text = 'test'::text)
Heap Blocks: exact=43928 lossy=197955
Buffers: shared hit=373 read=246174
-> BitmapAnd (cost=28935.65..28935.65 rows=9659 width=0) (actual time=272.626..272.629 rows=0 loops=1)
Buffers: shared hit=373 read=4291
-> Bitmap Index Scan on user_profile_trgm_idx (cost=0.00..2923.95 rows=320526 width=0) (actual time=177.577..177.577 rows=332572 loops=1)
Index Cond: (value ~~* '%auckland%'::text)
Buffers: shared hit=373 read=455
-> Bitmap Index Scan on user_profiles_key_idx (cost=0.00..26006.62 rows=994408 width=0) (actual time=92.586..92.589 rows=1000000 loops=1)
Index Cond: (key = 'timezone'::text)
Buffers: shared read=3836
-> Index Scan using end_user_id_idx on end_user u (cost=0.42..6.79 rows=1 width=1459) (actual time=0.027..0.027 rows=1 loops=332572)
Index Cond: ((id)::text = (p.id)::text)
Filter: ((environment_id)::text = 'test'::text)
Buffers: shared hit=952361 read=377927
Planning Time: 19.002 ms
Execution Time: 28497.570 ms |
这是一种耻辱,因为 exists
如果速度快的话会很方便,因为我可以在我的应用程序代码中动态添加更多条件,额外的条件表示为额外的 exists
子句。
顺便说一句,横向连接确实加快了速度,但我不明白为什么会有这么大的差异:
select * from users u,
lateral (
select id from user_profiles p
where
value ilike '%auckland%'
and key = 'timezone'
and environment_id = u.environment_id
and p.id = u.id
) ss
where u.environment_id = 'test';
执行时间 304ms
Gather (cost=29936.07..91577.38 rows=9658 width=1474) (actual time=1100.824..15430.620 rows=332572 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=1140551 read=436286
-> Nested Loop (cost=28936.07..89611.58 rows=4024 width=1474) (actual time=602.490..14805.285 rows=110857 loops=3)
Buffers: shared hit=1140551 read=436286
-> Parallel Bitmap Heap Scan on user_profiles p (cost=28935.65..62492.84 rows=4025 width=22) (actual time=602.078..12247.891 rows=110857 loops=3)
Recheck Cond: ((value ~~* '%auckland%'::text) AND (key = 'timezone'::text))
Rows Removed by Index Recheck: 5927036
Filter: ((environment_id)::text = 'test'::text)
Heap Blocks: exact=14659 lossy=65588
Buffers: shared hit=373 read=246174
-> BitmapAnd (cost=28935.65..28935.65 rows=9659 width=0) (actual time=1087.258..1087.259 rows=0 loops=1)
Buffers: shared hit=373 read=4291
-> Bitmap Index Scan on user_profile_trgm_idx (cost=0.00..2923.95 rows=320526 width=0) (actual time=853.075..853.076 rows=332572 loops=1)
Index Cond: (value ~~* '%auckland%'::text)
Buffers: shared hit=373 read=455
-> Bitmap Index Scan on user_profiles_key_idx (cost=0.00..26006.62 rows=994408 width=0) (actual time=231.295..231.295 rows=1000000 loops=1)
Index Cond: (key = 'timezone'::text)
Buffers: shared read=3836
-> Index Scan using end_user_id_idx on end_user u (cost=0.42..6.74 rows=1 width=1459) (actual time=0.022..0.022 rows=1 loops=332572)
Index Cond: ((id)::text = (p.id)::text)
Filter: ((environment_id)::text = 'test'::text)
Buffers: shared hit=1140178 read=190112
Planning Time: 16.877 ms
Execution Time: 15461.571 ms
很想知道为什么 exists
子查询这么慢,以及我可以在这里查看的任何其他选项。
根据 Erwin 的要求进行不同的计数,请注意这是测试负载的虚拟数据,但它相当接近生产比率
select count(distinct environment_id) => 4
, count(distinct key) => 33
, count(distinct value) => 15M
from private.user_profiles;
按照 Erwin 的建议将工作内存增加到 16MB 后更新:
ALTER SYSTEM SET work_mem to '16MB';
SELECT pg_reload_conf();
exists 查询的执行时间为 500 毫秒,情况看起来好多了。现在这样解释。
Gather (cost=3926.79..400754.43 rows=9658 width=1459) (actual time=312.213..9396.610 rows=332572 loops=1) |
Workers Planned: 2 |
Workers Launched: 2 |
Buffers: shared hit=1141083 read=431918 |
-> Nested Loop (cost=2926.79..398788.63 rows=4024 width=1459) (actual time=155.271..8987.721 rows=110857 loops=3) |
Buffers: shared hit=1141083 read=431918 |
-> Parallel Bitmap Heap Scan on user_profiles p (cost=2926.36..371669.88 rows=4025 width=15) (actual time=150.989..2962.870 rows=110857 loops=3)|
Recheck Cond: (value ~~* '%auckland%'::text) |
Filter: (((environment_id)::text = 'test'::text) AND (key = 'timezone'::text)) |
Heap Blocks: exact=82556 |
Buffers: shared hit=981 read=241730 |
-> Bitmap Index Scan on user_profile_trgm_idx (cost=0.00..2923.95 rows=320526 width=0) (actual time=243.604..243.605 rows=332572 loops=1) |
Index Cond: (value ~~* '%auckland%'::text) |
Buffers: shared hit=828 |
-> Index Scan using end_user_id_idx on end_user u (cost=0.42..6.74 rows=1 width=1459) (actual time=0.054..0.054 rows=1 loops=332572) |
Index Cond: ((id)::text = (p.id)::text) |
Filter: ((environment_id)::text = 'test'::text) |
Buffers: shared hit=1140102 read=190188 |
Planning Time: 9.932 ms |
Execution Time: 9427.067 ms |
服务器配置
第一个问题在 EXPLAIN
输出的这一行变得很明显:
Heap Blocks: exact=14659 lossy=65588
lossy
表示您没有足够的 work_mem
。你的设置显然很低。 (默认设置 4 MB 对于涉及 table 数百万行的数据库来说太低了。)参见:
很有可能,您需要在服务器配置方面做更多的工作。一般来说,您似乎在 RAM 上受到限制。我看到高“读取”计数,这表明冷缓存 and/or 缓存内存不足或配置错误。
此 Postgres Wiki page 可以帮助您入门。
SQL/JSON 在 Postgres 12 或更高版本中
My answer you have been working off 已过时。当前的 Postgres 版本是 2015 年 7 月的 9.4!
在 Postgres 12 中(就像您稍后提交的那样)整个设计可以 非常简单,在 SQL/JSON 中使用正则表达式。 The manual:
SQL/JSON path expressions allow matching text to a regular expression with the like_regex
filter.
还有索引支持。废弃物化视图。我们只需要您的原始 table 和一个索引,例如:
CREATE INDEX end_user_path_ops_idx ON end_user USING GIN (profile jsonb_path_ops);
这个查询相当于你原来的,可以使用索引:
SELECT *
FROM end_user u
WHERE environment_id = 'test'
AND profile @? '$.timezone ? (@ like_regex "auck" flag "i")';
db<>fiddle here
一个缺点是 SQL/JSON 路径语言需要习惯。
延伸阅读:
运行 Postgres 12.5 的本地 docker 实例(4MB work_mem)。
我正在实施 this pattern 来搜索 json 中的任意字段。目标是快速搜索 return JSON 列 profile
:
CREATE TABLE end_user (
id varchar NOT NULL,
environment_id varchar NOT NULL,
profile jsonb NOT NULL DEFAULT '{}'::jsonb,
CONSTRAINT end_user_pkey PRIMARY KEY (environment_id, id)
);
CREATE INDEX end_user_environment_id_idx ON private.end_user USING btree (environment_id);
CREATE INDEX end_user_id_idx ON private.end_user USING btree (id);
CREATE INDEX end_user_profile_idx ON private.end_user USING gin (profile);
CREATE MATERIALIZED VIEW user_profiles AS
SELECT u.environment_id, u.id, j.key, j.value
FROM end_user u, jsonb_each_text(u.profile) j(key, value);
CREATE UNIQUE INDEX on user_profiles (environment_id, id, key);
CREATE INDEX user_profile_trgm_idx ON user_profiles using gin (value gin_trgm_ops);
我有一个 indexed correctly 的查询,因此它可以在几毫秒内执行一百万行。 ✅
select * from user_profiles
where value ilike '%auckland%' and key = 'timezone' and environment_id = 'test';
执行时间 42ms
Bitmap Heap Scan on user_profiles (cost=28935.65..62591.44 rows=9659 width=65)
Recheck Cond: ((value ~~* '%auckland%'::text) AND (key = 'timezone'::text))
Filter: ((environment_id)::text = 'test'::text)
-> BitmapAnd (cost=28935.65..28935.65 rows=9659 width=0)
-> Bitmap Index Scan on user_profile_trgm_idx (cost=0.00..2923.95 rows=320526 width=0)
Index Cond: (value ~~* '%auckland%'::text)
-> Bitmap Index Scan on user_profiles_key_idx (cost=0.00..26006.62 rows=994408 width=0)
Index Cond: (key = 'timezone'::text)
但是,如果我将它与 exists
查询一起使用以建立如下条件:
select * users u
where
environment_id = 'test'
and exists (
select 1 from user_profiles p
where
value ilike '%auckland%'
and key = 'timezone'
and p.id = u.id
and environment_id = 'test'
)
执行速度很慢。
执行时间 17.44 秒
Nested Loop (cost=62616.01..124606.45 rows=9658 width=1459) (actual time=19206.818..28444.491 rows=332572 loops=1)
Buffers: shared hit=952734 read=624101
-> HashAggregate (cost=62615.59..62707.52 rows=9193 width=15) (actual time=19205.238..19292.998 rows=332572 loops=1)
Group Key: (p.id)::text
Buffers: shared hit=373 read=246174
-> Bitmap Heap Scan on user_profiles p (cost=28935.65..62591.44 rows=9659 width=15) (actual time=278.211..18942.629 rows=332572 loops=1)
Recheck Cond: ((value ~~* '%auckland%'::text) AND (key = 'timezone'::text))
Rows Removed by Index Recheck: 17781109
Filter: ((environment_id)::text = 'test'::text)
Heap Blocks: exact=43928 lossy=197955
Buffers: shared hit=373 read=246174
-> BitmapAnd (cost=28935.65..28935.65 rows=9659 width=0) (actual time=272.626..272.629 rows=0 loops=1)
Buffers: shared hit=373 read=4291
-> Bitmap Index Scan on user_profile_trgm_idx (cost=0.00..2923.95 rows=320526 width=0) (actual time=177.577..177.577 rows=332572 loops=1)
Index Cond: (value ~~* '%auckland%'::text)
Buffers: shared hit=373 read=455
-> Bitmap Index Scan on user_profiles_key_idx (cost=0.00..26006.62 rows=994408 width=0) (actual time=92.586..92.589 rows=1000000 loops=1)
Index Cond: (key = 'timezone'::text)
Buffers: shared read=3836
-> Index Scan using end_user_id_idx on end_user u (cost=0.42..6.79 rows=1 width=1459) (actual time=0.027..0.027 rows=1 loops=332572)
Index Cond: ((id)::text = (p.id)::text)
Filter: ((environment_id)::text = 'test'::text)
Buffers: shared hit=952361 read=377927
Planning Time: 19.002 ms
Execution Time: 28497.570 ms |
这是一种耻辱,因为 exists
如果速度快的话会很方便,因为我可以在我的应用程序代码中动态添加更多条件,额外的条件表示为额外的 exists
子句。
顺便说一句,横向连接确实加快了速度,但我不明白为什么会有这么大的差异:
select * from users u,
lateral (
select id from user_profiles p
where
value ilike '%auckland%'
and key = 'timezone'
and environment_id = u.environment_id
and p.id = u.id
) ss
where u.environment_id = 'test';
执行时间 304ms
Gather (cost=29936.07..91577.38 rows=9658 width=1474) (actual time=1100.824..15430.620 rows=332572 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=1140551 read=436286
-> Nested Loop (cost=28936.07..89611.58 rows=4024 width=1474) (actual time=602.490..14805.285 rows=110857 loops=3)
Buffers: shared hit=1140551 read=436286
-> Parallel Bitmap Heap Scan on user_profiles p (cost=28935.65..62492.84 rows=4025 width=22) (actual time=602.078..12247.891 rows=110857 loops=3)
Recheck Cond: ((value ~~* '%auckland%'::text) AND (key = 'timezone'::text))
Rows Removed by Index Recheck: 5927036
Filter: ((environment_id)::text = 'test'::text)
Heap Blocks: exact=14659 lossy=65588
Buffers: shared hit=373 read=246174
-> BitmapAnd (cost=28935.65..28935.65 rows=9659 width=0) (actual time=1087.258..1087.259 rows=0 loops=1)
Buffers: shared hit=373 read=4291
-> Bitmap Index Scan on user_profile_trgm_idx (cost=0.00..2923.95 rows=320526 width=0) (actual time=853.075..853.076 rows=332572 loops=1)
Index Cond: (value ~~* '%auckland%'::text)
Buffers: shared hit=373 read=455
-> Bitmap Index Scan on user_profiles_key_idx (cost=0.00..26006.62 rows=994408 width=0) (actual time=231.295..231.295 rows=1000000 loops=1)
Index Cond: (key = 'timezone'::text)
Buffers: shared read=3836
-> Index Scan using end_user_id_idx on end_user u (cost=0.42..6.74 rows=1 width=1459) (actual time=0.022..0.022 rows=1 loops=332572)
Index Cond: ((id)::text = (p.id)::text)
Filter: ((environment_id)::text = 'test'::text)
Buffers: shared hit=1140178 read=190112
Planning Time: 16.877 ms
Execution Time: 15461.571 ms
很想知道为什么 exists
子查询这么慢,以及我可以在这里查看的任何其他选项。
根据 Erwin 的要求进行不同的计数,请注意这是测试负载的虚拟数据,但它相当接近生产比率
select count(distinct environment_id) => 4
, count(distinct key) => 33
, count(distinct value) => 15M
from private.user_profiles;
按照 Erwin 的建议将工作内存增加到 16MB 后更新:
ALTER SYSTEM SET work_mem to '16MB';
SELECT pg_reload_conf();
exists 查询的执行时间为 500 毫秒,情况看起来好多了。现在这样解释。
Gather (cost=3926.79..400754.43 rows=9658 width=1459) (actual time=312.213..9396.610 rows=332572 loops=1) |
Workers Planned: 2 |
Workers Launched: 2 |
Buffers: shared hit=1141083 read=431918 |
-> Nested Loop (cost=2926.79..398788.63 rows=4024 width=1459) (actual time=155.271..8987.721 rows=110857 loops=3) |
Buffers: shared hit=1141083 read=431918 |
-> Parallel Bitmap Heap Scan on user_profiles p (cost=2926.36..371669.88 rows=4025 width=15) (actual time=150.989..2962.870 rows=110857 loops=3)|
Recheck Cond: (value ~~* '%auckland%'::text) |
Filter: (((environment_id)::text = 'test'::text) AND (key = 'timezone'::text)) |
Heap Blocks: exact=82556 |
Buffers: shared hit=981 read=241730 |
-> Bitmap Index Scan on user_profile_trgm_idx (cost=0.00..2923.95 rows=320526 width=0) (actual time=243.604..243.605 rows=332572 loops=1) |
Index Cond: (value ~~* '%auckland%'::text) |
Buffers: shared hit=828 |
-> Index Scan using end_user_id_idx on end_user u (cost=0.42..6.74 rows=1 width=1459) (actual time=0.054..0.054 rows=1 loops=332572) |
Index Cond: ((id)::text = (p.id)::text) |
Filter: ((environment_id)::text = 'test'::text) |
Buffers: shared hit=1140102 read=190188 |
Planning Time: 9.932 ms |
Execution Time: 9427.067 ms |
服务器配置
第一个问题在 EXPLAIN
输出的这一行变得很明显:
Heap Blocks: exact=14659 lossy=65588
lossy
表示您没有足够的 work_mem
。你的设置显然很低。 (默认设置 4 MB 对于涉及 table 数百万行的数据库来说太低了。)参见:
很有可能,您需要在服务器配置方面做更多的工作。一般来说,您似乎在 RAM 上受到限制。我看到高“读取”计数,这表明冷缓存 and/or 缓存内存不足或配置错误。
此 Postgres Wiki page 可以帮助您入门。
SQL/JSON 在 Postgres 12 或更高版本中
My answer you have been working off 已过时。当前的 Postgres 版本是 2015 年 7 月的 9.4!
在 Postgres 12 中(就像您稍后提交的那样)整个设计可以 非常简单,在 SQL/JSON 中使用正则表达式。 The manual:
SQL/JSON path expressions allow matching text to a regular expression with the
like_regex
filter.
还有索引支持。废弃物化视图。我们只需要您的原始 table 和一个索引,例如:
CREATE INDEX end_user_path_ops_idx ON end_user USING GIN (profile jsonb_path_ops);
这个查询相当于你原来的,可以使用索引:
SELECT *
FROM end_user u
WHERE environment_id = 'test'
AND profile @? '$.timezone ? (@ like_regex "auck" flag "i")';
db<>fiddle here
一个缺点是 SQL/JSON 路径语言需要习惯。
延伸阅读: