哪个 Postgresql 索引对于具有基于相似性的查询的文本列最有效
Which Postgresql index is most efficient for text column with queries based on similarity
我想为以下用例在文本列上创建索引。我们有一个 Segment
的 table,其中有一列 content
类型的文本。我们使用 pg_trgm 基于相似度执行查询。这在翻译编辑器中用于查找相似的字符串。
以下是 table 详细信息:
CREATE TABLE public.segments
(
id integer NOT NULL DEFAULT nextval('segments_id_seq'::regclass),
language_id integer NOT NULL,
content text NOT NULL,
created_at timestamp without time zone NOT NULL,
updated_at timestamp without time zone NOT NULL,
CONSTRAINT segments_pkey PRIMARY KEY (id),
CONSTRAINT segments_language_id_fkey FOREIGN KEY (language_id)
REFERENCES public.languages (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE CASCADE,
CONSTRAINT segments_content_language_id_key UNIQUE (content, language_id)
)
这里是查询 (Ruby + Hanami):
def find_by_segment_match(source_text_for_lookup, source_lang, sim_score)
aggregate(:translation_records)
.where(language_id: source_lang)
.where { similarity(:content, source_text_for_lookup) > sim_score/100.00 }
.select_append { float::similarity(:content, source_text_for_lookup).as(:similarity) }
.order { similarity(:content, source_text_for_lookup).desc }
end
---编辑---
这是查询:
SELECT "id", "language_id", "content", "created_at", "updated_at", SIMILARITY("content", 'This will not work.') AS "similarity" FROM "segments" WHERE (("language_id" = 2) AND (similarity("content", 'This will not work.') > 0.45)) ORDER BY SIMILARITY("content", 'This will not work.') DESC
SELECT "translation_records"."id", "translation_records"."source_segment_id", "translation_records"."target_segment_id", "translation_records"."domain_id",
"translation_records"."style_id",
"translation_records"."created_by", "translation_records"."updated_by", "translation_records"."project_name", "translation_records"."created_at", "translation_records"."updated_at", "translation_records"."language_combination", "translation_records"."uid",
"translation_records"."import_comment" FROM "translation_records" INNER JOIN "segments" ON ("segments"."id" = "translation_records"."source_segment_id") WHERE ("translation_records"."source_segment_id" IN (27548)) ORDER BY "translation_records"."id"
---结束编辑---
---编辑 1---
重新索引怎么样?最初我们将导入大约 200 万条遗留记录。我们应该何时以及多久重建一次索引?
---结束编辑 1---
像 CREATE INDEX ON segment USING gist (content) 这样的东西可以吗?我真的找不到哪个可用索引最适合我们的用例table。
最好的,seba
CREATE INDEX segment_language_id_idx ON segment USING btree (language_id);
CREATE INDEX segment_content_gin ON segment USING gin (content gin_trgm_ops);
您显示的第二个查询似乎与此问题无关。
您的第一个查询不能使用三元组索引,因为查询必须以运算符形式而不是函数形式编写才能做到这一点。
在运算符形式中,它看起来像这样:
SELECT "id", "language_id", "content", "created_at", "updated_at", SIMILARITY("content", 'This will not work.') AS "similarity"
FROM segments
WHERE language_id = 2 AND content % 'This will not work.'
ORDER BY content <-> 'This will not work.';
为了使 %
等同于 similarity("content", 'This will not work.') > 0.45
,您首先需要执行 set pg_trgm.similarity_threshold TO 0.45;
.
现在你是如何ruby/hanami生成这个表格的,我不知道。
gin_trgm_ops 索引或 gist_index_ops 索引都可以支持 % 运算符。 <-> 只能被 gist_trgm_ops 支持。但很难预测这种支持的效率。如果您的“内容”列很长或要比较的文本很长,则不太可能非常有效,尤其是在要点的情况下。
理想情况下,您可以将 table 按 language_id 划分。如果没有,那么 可能 有助于构建具有两列的多列索引。
我想为以下用例在文本列上创建索引。我们有一个 Segment
的 table,其中有一列 content
类型的文本。我们使用 pg_trgm 基于相似度执行查询。这在翻译编辑器中用于查找相似的字符串。
以下是 table 详细信息:
CREATE TABLE public.segments
(
id integer NOT NULL DEFAULT nextval('segments_id_seq'::regclass),
language_id integer NOT NULL,
content text NOT NULL,
created_at timestamp without time zone NOT NULL,
updated_at timestamp without time zone NOT NULL,
CONSTRAINT segments_pkey PRIMARY KEY (id),
CONSTRAINT segments_language_id_fkey FOREIGN KEY (language_id)
REFERENCES public.languages (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE CASCADE,
CONSTRAINT segments_content_language_id_key UNIQUE (content, language_id)
)
这里是查询 (Ruby + Hanami):
def find_by_segment_match(source_text_for_lookup, source_lang, sim_score)
aggregate(:translation_records)
.where(language_id: source_lang)
.where { similarity(:content, source_text_for_lookup) > sim_score/100.00 }
.select_append { float::similarity(:content, source_text_for_lookup).as(:similarity) }
.order { similarity(:content, source_text_for_lookup).desc }
end
---编辑---
这是查询:
SELECT "id", "language_id", "content", "created_at", "updated_at", SIMILARITY("content", 'This will not work.') AS "similarity" FROM "segments" WHERE (("language_id" = 2) AND (similarity("content", 'This will not work.') > 0.45)) ORDER BY SIMILARITY("content", 'This will not work.') DESC
SELECT "translation_records"."id", "translation_records"."source_segment_id", "translation_records"."target_segment_id", "translation_records"."domain_id",
"translation_records"."style_id",
"translation_records"."created_by", "translation_records"."updated_by", "translation_records"."project_name", "translation_records"."created_at", "translation_records"."updated_at", "translation_records"."language_combination", "translation_records"."uid",
"translation_records"."import_comment" FROM "translation_records" INNER JOIN "segments" ON ("segments"."id" = "translation_records"."source_segment_id") WHERE ("translation_records"."source_segment_id" IN (27548)) ORDER BY "translation_records"."id"
---结束编辑---
---编辑 1---
重新索引怎么样?最初我们将导入大约 200 万条遗留记录。我们应该何时以及多久重建一次索引?
---结束编辑 1---
像 CREATE INDEX ON segment USING gist (content) 这样的东西可以吗?我真的找不到哪个可用索引最适合我们的用例table。
最好的,seba
CREATE INDEX segment_language_id_idx ON segment USING btree (language_id);
CREATE INDEX segment_content_gin ON segment USING gin (content gin_trgm_ops);
您显示的第二个查询似乎与此问题无关。
您的第一个查询不能使用三元组索引,因为查询必须以运算符形式而不是函数形式编写才能做到这一点。
在运算符形式中,它看起来像这样:
SELECT "id", "language_id", "content", "created_at", "updated_at", SIMILARITY("content", 'This will not work.') AS "similarity"
FROM segments
WHERE language_id = 2 AND content % 'This will not work.'
ORDER BY content <-> 'This will not work.';
为了使 %
等同于 similarity("content", 'This will not work.') > 0.45
,您首先需要执行 set pg_trgm.similarity_threshold TO 0.45;
.
现在你是如何ruby/hanami生成这个表格的,我不知道。
gin_trgm_ops 索引或 gist_index_ops 索引都可以支持 % 运算符。 <-> 只能被 gist_trgm_ops 支持。但很难预测这种支持的效率。如果您的“内容”列很长或要比较的文本很长,则不太可能非常有效,尤其是在要点的情况下。
理想情况下,您可以将 table 按 language_id 划分。如果没有,那么 可能 有助于构建具有两列的多列索引。