Postgres 全文搜索和拼写错误(又名模糊全文搜索)
Postgres full text search and spelling mistakes (aka fuzzy full text search)
我有一个场景,我有需要能够搜索的非正式通信数据。因此我想要全文搜索,但我也想弄清楚拼写错误。问题是如何考虑拼写错误以便能够进行模糊全文搜索?
在 Postgres Full Text Search is Good Enough 讨论拼写错误的文章中 对此进行了非常简短的讨论。
所以我构建了 "documents" 的 table,创建了索引等
CREATE TABLE data (
id int GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
text TEXT NOT NULL);
我可以创建一个额外的 tsvector 类型的列并相应地建立索引...
alter table data
add column search_index tsvector
generated always as (to_tsvector('english', coalesce(text, '')))
STORED;
create index search_index_idx on data using gin (search_index);
例如,我有一些文本,其中数据显示 "baloon",但有人可能会搜索 "balloon",所以我插入两行(故意拼错)...
insert into data (text) values ('baloon');
insert into data (text) values ('balloon');
select * from data;
id | text | search_index
----+---------+--------------
1 | baloon | 'baloon':1
2 | balloon | 'balloon':1
...并对数据执行全文搜索...
select * from data where search_index @@ plainto_tsquery('balloon');
id | text | search_index
----+---------+--------------
2 | balloon | 'balloon':1
(1 row)
但是我没有得到拼写错误版本的结果 "baloon"...所以使用链接文章中的建议,我已经建立了对我的所有单词的查找 table词库如下...
"you may obtain good results by appending the similar lexeme to your tsquery"
CREATE TABLE data_words AS SELECT word FROM ts_stat('SELECT to_tsvector(''simple'', text) FROM data');
CREATE INDEX data_words_idx ON data_words USING GIN (word gin_trgm_ops);
...我可以搜索可能拼错的相似词
select word, similarity(word, 'balloon') 作为来自 data_words 的相似度 where similarity(word, 'balloon') > 0.4 order by similarity(word, 'balloon');
word | similarity
---------+------------
baloon | 0.6666667
balloon | 1
...但是我如何在查询中实际包含拼写错误的单词?
这不是上面文章的意思吗?
select plainto_tsquery('balloon' || ' ' || (select string_agg(word, ' ') from data_words where similarity(word, 'balloon') > 0.4));
plainto_tsquery
----------------------------------
'balloon' & 'baloon' & 'balloon'
(1 row)
...插入实际搜索,我没有得到任何行!
select * from data where text @@ plainto_tsquery('balloon' || ' ' || (select string_agg(word, ' ') from data_words where similarity(word, 'balloon') > 0.4));
select * from data where search_index @@ phraseto_tsquery('baloon balloon'); -- no rows returned
我不确定我哪里出错了 - 有什么可以说明的吗?我觉得我非常接近实现这一目标...?
SELECT to_tsquery('balloon |' ||
string_agg(word, ' | ')
)
FROM data_words
WHERE similarity(word, 'balloon') > 0.4;
我有一个场景,我有需要能够搜索的非正式通信数据。因此我想要全文搜索,但我也想弄清楚拼写错误。问题是如何考虑拼写错误以便能够进行模糊全文搜索?
在 Postgres Full Text Search is Good Enough 讨论拼写错误的文章中 对此进行了非常简短的讨论。
所以我构建了 "documents" 的 table,创建了索引等
CREATE TABLE data (
id int GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
text TEXT NOT NULL);
我可以创建一个额外的 tsvector 类型的列并相应地建立索引...
alter table data
add column search_index tsvector
generated always as (to_tsvector('english', coalesce(text, '')))
STORED;
create index search_index_idx on data using gin (search_index);
例如,我有一些文本,其中数据显示 "baloon",但有人可能会搜索 "balloon",所以我插入两行(故意拼错)...
insert into data (text) values ('baloon');
insert into data (text) values ('balloon');
select * from data;
id | text | search_index
----+---------+--------------
1 | baloon | 'baloon':1
2 | balloon | 'balloon':1
...并对数据执行全文搜索...
select * from data where search_index @@ plainto_tsquery('balloon');
id | text | search_index
----+---------+--------------
2 | balloon | 'balloon':1
(1 row)
但是我没有得到拼写错误版本的结果 "baloon"...所以使用链接文章中的建议,我已经建立了对我的所有单词的查找 table词库如下...
"you may obtain good results by appending the similar lexeme to your tsquery"
CREATE TABLE data_words AS SELECT word FROM ts_stat('SELECT to_tsvector(''simple'', text) FROM data');
CREATE INDEX data_words_idx ON data_words USING GIN (word gin_trgm_ops);
...我可以搜索可能拼错的相似词
select word, similarity(word, 'balloon') 作为来自 data_words 的相似度 where similarity(word, 'balloon') > 0.4 order by similarity(word, 'balloon');
word | similarity
---------+------------
baloon | 0.6666667
balloon | 1
...但是我如何在查询中实际包含拼写错误的单词?
这不是上面文章的意思吗?
select plainto_tsquery('balloon' || ' ' || (select string_agg(word, ' ') from data_words where similarity(word, 'balloon') > 0.4));
plainto_tsquery
----------------------------------
'balloon' & 'baloon' & 'balloon'
(1 row)
...插入实际搜索,我没有得到任何行!
select * from data where text @@ plainto_tsquery('balloon' || ' ' || (select string_agg(word, ' ') from data_words where similarity(word, 'balloon') > 0.4));
select * from data where search_index @@ phraseto_tsquery('baloon balloon'); -- no rows returned
我不确定我哪里出错了 - 有什么可以说明的吗?我觉得我非常接近实现这一目标...?
SELECT to_tsquery('balloon |' ||
string_agg(word, ' | ')
)
FROM data_words
WHERE similarity(word, 'balloon') > 0.4;