使用自动更正快速搜索(GIN INDEX 和 PG_TRGM 扩展名)
Quick Search with autocorrect (GIN INDEX and PG_TRGM extension)
我正在测试一个简单的搜索机制来处理 SMALL typos/misspellings。类似于自动更正机制。
我为此苦苦挣扎。所以我正在创建一个函数 (pl/pgsql) 来处理这个问题,我在 SUPABASE.IO、PostgreSQL 13.3(类似于 RDS)上 运行。
我愿意:
- 将 returned 结果限制为仅高度相似的电子邮件地址,比如相似度 > 0.7;
- 使用 INDEX,因为实际的电子邮件列表将达到数千万,因此它必须 return 一秒钟之内。
DROP TABLE IF EXISTS email;
CREATE TABLE email (
email_address TEXT NOT NULL UNIQUE,
person_id UUID NOT NULL,
CONSTRAINT email_pk PRIMARY KEY (email_address)
);
DROP INDEX IF EXISTS email_address_trigram_idx;
CREATE INDEX email_address_trigram_idx ON email USING gin(email_address gin_trgm_ops);
INSERT INTO email(email_address, person_id) VALUES
('test100@gmail.com', uuid_generate_v4())
, ('100test@gmail.com', uuid_generate_v4())
, ('testoo1000@gmail.com', uuid_generate_v4())
, ('test1001@gmail.com', uuid_generate_v4())
, ('test100@gmial.com', uuid_generate_v4())
, ('test200@gmail.com', uuid_generate_v4())
, ('200test@gmail.com', uuid_generate_v4())
, ('testoo2000@gmail.com', uuid_generate_v4())
, ('test2002@gmail.com', uuid_generate_v4())
, ('test200@gmial.com', uuid_generate_v4())
, ('test300@gmail.com', uuid_generate_v4())
, ('300test@gmail.com', uuid_generate_v4())
, ('testoo3000@gmail.com', uuid_generate_v4())
, ('test3003@gmail.com', uuid_generate_v4())
, ('test300@gmial.com', uuid_generate_v4())
, ('test400@gmail.com', uuid_generate_v4())
, ('400test@gmail.com', uuid_generate_v4())
, ('testoo4000@gmail.com', uuid_generate_v4())
, ('test4004@gmail.com', uuid_generate_v4())
, ('test400@gmial.com', uuid_generate_v4())
, ('tset100@gmail.com', uuid_generate_v4())
, ('100tset@gmail.com', uuid_generate_v4())
, ('tsetoo1000@gmail.com', uuid_generate_v4())
, ('tset1001@gmail.com', uuid_generate_v4())
, ('tset100@gmial.com', uuid_generate_v4())
, ('tset200@gmail.com', uuid_generate_v4())
, ('200tset@gmail.com', uuid_generate_v4())
, ('tsetoo2000@gmail.com', uuid_generate_v4())
, ('tset2002@gmail.com', uuid_generate_v4())
, ('tset200@gmial.com', uuid_generate_v4())
, ('tset300@gmail.com', uuid_generate_v4())
, ('300tset@gmail.com', uuid_generate_v4())
, ('tsetoo3000@gmail.com', uuid_generate_v4())
, ('tset3003@gmail.com', uuid_generate_v4())
, ('tset300@gmial.com', uuid_generate_v4())
, ('tset400@gmail.com', uuid_generate_v4())
, ('400tset@gmail.com', uuid_generate_v4())
, ('tsetoo4000@gmail.com', uuid_generate_v4())
, ('tset4004@gmail.com', uuid_generate_v4())
, ('tset400@gmial.com', uuid_generate_v4())
, ('different_email@yahoo.com', uuid_generate_v4());
SET pg_trgm.similarity_threshold = 0.8; -- This doesn't seem to affect my queries
SELECT *, similarity('tesd100@gmail.com', email_address)
FROM email
WHERE email_address % 'tesd100@gmail.com';
我想要一种既能快速搜索又能容忍搜索中的一些小拼写错误的方法。
首先,您的 table 定义在 (email_address)
上创建了 两个 个唯一索引。不。删除 UNIQUE
约束,保留 PK:
CREATE TABLE email (
email_address text PRIMARY KEY
, person_id uuid NOT NULL -- bigint?
);
(也不确定为什么 person_id
需要 uuid
。世界上几乎没有足够的人来证明超过 bigint
。)
接下来,既然你想...
LIMIT the returned results to only the highly similar email addresses,
我建议最近邻搜索。为此目的创建一个 GiST 索引而不是 GIN:
CREATE INDEX email_address_trigram_gist_idx ON email USING gist (email_address gist_trgm_ops);
并使用这样的查询:
SELECT *, similarity('tesd100@gmail.com', email_address)
FROM email
WHERE email_address % 'tesd100@gmail.com'
ORDER BY email_address <-> 'tesd100@gmail.com' -- note the use of the operator <->
LIMIT 10;
引用 the manual:
This can be implemented quite efficiently by GiST indexes, but not by
GIN indexes. It will usually beat the first formulation when only a
small number of the closest matches is wanted.
在使用较小的 LIMIT
时,可能不需要将 pg_trgm.similarity_threshold
设置得非常高,因为此查询会首先为您提供最佳匹配。
相关:
- Search in 300 million addresses with pg_trgm
- Finding similar strings with PostgreSQL quickly
我正在测试一个简单的搜索机制来处理 SMALL typos/misspellings。类似于自动更正机制。
我为此苦苦挣扎。所以我正在创建一个函数 (pl/pgsql) 来处理这个问题,我在 SUPABASE.IO、PostgreSQL 13.3(类似于 RDS)上 运行。
我愿意:
- 将 returned 结果限制为仅高度相似的电子邮件地址,比如相似度 > 0.7;
- 使用 INDEX,因为实际的电子邮件列表将达到数千万,因此它必须 return 一秒钟之内。
DROP TABLE IF EXISTS email;
CREATE TABLE email (
email_address TEXT NOT NULL UNIQUE,
person_id UUID NOT NULL,
CONSTRAINT email_pk PRIMARY KEY (email_address)
);
DROP INDEX IF EXISTS email_address_trigram_idx;
CREATE INDEX email_address_trigram_idx ON email USING gin(email_address gin_trgm_ops);
INSERT INTO email(email_address, person_id) VALUES
('test100@gmail.com', uuid_generate_v4())
, ('100test@gmail.com', uuid_generate_v4())
, ('testoo1000@gmail.com', uuid_generate_v4())
, ('test1001@gmail.com', uuid_generate_v4())
, ('test100@gmial.com', uuid_generate_v4())
, ('test200@gmail.com', uuid_generate_v4())
, ('200test@gmail.com', uuid_generate_v4())
, ('testoo2000@gmail.com', uuid_generate_v4())
, ('test2002@gmail.com', uuid_generate_v4())
, ('test200@gmial.com', uuid_generate_v4())
, ('test300@gmail.com', uuid_generate_v4())
, ('300test@gmail.com', uuid_generate_v4())
, ('testoo3000@gmail.com', uuid_generate_v4())
, ('test3003@gmail.com', uuid_generate_v4())
, ('test300@gmial.com', uuid_generate_v4())
, ('test400@gmail.com', uuid_generate_v4())
, ('400test@gmail.com', uuid_generate_v4())
, ('testoo4000@gmail.com', uuid_generate_v4())
, ('test4004@gmail.com', uuid_generate_v4())
, ('test400@gmial.com', uuid_generate_v4())
, ('tset100@gmail.com', uuid_generate_v4())
, ('100tset@gmail.com', uuid_generate_v4())
, ('tsetoo1000@gmail.com', uuid_generate_v4())
, ('tset1001@gmail.com', uuid_generate_v4())
, ('tset100@gmial.com', uuid_generate_v4())
, ('tset200@gmail.com', uuid_generate_v4())
, ('200tset@gmail.com', uuid_generate_v4())
, ('tsetoo2000@gmail.com', uuid_generate_v4())
, ('tset2002@gmail.com', uuid_generate_v4())
, ('tset200@gmial.com', uuid_generate_v4())
, ('tset300@gmail.com', uuid_generate_v4())
, ('300tset@gmail.com', uuid_generate_v4())
, ('tsetoo3000@gmail.com', uuid_generate_v4())
, ('tset3003@gmail.com', uuid_generate_v4())
, ('tset300@gmial.com', uuid_generate_v4())
, ('tset400@gmail.com', uuid_generate_v4())
, ('400tset@gmail.com', uuid_generate_v4())
, ('tsetoo4000@gmail.com', uuid_generate_v4())
, ('tset4004@gmail.com', uuid_generate_v4())
, ('tset400@gmial.com', uuid_generate_v4())
, ('different_email@yahoo.com', uuid_generate_v4());
SET pg_trgm.similarity_threshold = 0.8; -- This doesn't seem to affect my queries
SELECT *, similarity('tesd100@gmail.com', email_address)
FROM email
WHERE email_address % 'tesd100@gmail.com';
我想要一种既能快速搜索又能容忍搜索中的一些小拼写错误的方法。
首先,您的 table 定义在 (email_address)
上创建了 两个 个唯一索引。不。删除 UNIQUE
约束,保留 PK:
CREATE TABLE email (
email_address text PRIMARY KEY
, person_id uuid NOT NULL -- bigint?
);
(也不确定为什么 person_id
需要 uuid
。世界上几乎没有足够的人来证明超过 bigint
。)
接下来,既然你想...
LIMIT the returned results to only the highly similar email addresses,
我建议最近邻搜索。为此目的创建一个 GiST 索引而不是 GIN:
CREATE INDEX email_address_trigram_gist_idx ON email USING gist (email_address gist_trgm_ops);
并使用这样的查询:
SELECT *, similarity('tesd100@gmail.com', email_address)
FROM email
WHERE email_address % 'tesd100@gmail.com'
ORDER BY email_address <-> 'tesd100@gmail.com' -- note the use of the operator <->
LIMIT 10;
引用 the manual:
This can be implemented quite efficiently by GiST indexes, but not by GIN indexes. It will usually beat the first formulation when only a small number of the closest matches is wanted.
在使用较小的 LIMIT
时,可能不需要将 pg_trgm.similarity_threshold
设置得非常高,因为此查询会首先为您提供最佳匹配。
相关:
- Search in 300 million addresses with pg_trgm
- Finding similar strings with PostgreSQL quickly