Postgres pg_trgm - 为什么按相似度排序很慢

Postgres pg_trgm - why ordering by similarity is very slow

我有 table UsersdisplayName (text)pg_trgm gin index 列。

CREATE INDEX "Users-displayName-pg-trgm-index"
  ON "Users"
  USING gin
  ("displayName" COLLATE pg_catalog."default" gin_trgm_ops);

这是我的查询:

SELECT "User"."id"
    ,"User"."displayName"
    ,"User"."firstName"
    ,"User"."lastName"
    ,"User"."email"
    ,"User"."password"
    ,"User"."isVerified"
    ,"User"."isBlocked"
    ,"User"."verificationToken"
    ,"User"."birthDate"
    ,"User"."gender"
    ,"User"."isPrivate"
    ,"User"."role"
    ,"User"."coverImageUrl"
    ,"User"."profileImageUrl"
    ,"User"."facebookId"
    ,"User"."deviceType"
    ,"User"."deviceToken"
    ,"User"."coins"
    ,"User"."LocaleId"
    ,"User"."createdAt"
    ,"User"."updatedAt"
FROM "Users" AS "User"
WHERE (similarity("User"."displayName", 'John') > 0.2)
ORDER BY similarity("User"."displayName", 'John')
    ,"User"."id" ASC LIMIT 25;

上面的查询需要 ~200ms 到 return 个结果。当我删除

ORDER BY similarity("User"."displayName", 'John')

并仅按 id 进行排序,然后查询速度可达 30ms

我在 table 上查询 50k 个用户。

下面是解释分析:http://explain.depesz.com/s/lXC

出于某种原因,我没有看到任何索引使用情况(gin pg_trgm on displayName


似乎当我替换行时

WHERE (similarity("User"."displayName", 'John') > 0.2)

WHERE ("User"."displayName" % 'John')

查询速度超快 - 谁能告诉我为什么?我认为 % 运算符只是检查相似性(...)是否大于阈值...那么有什么区别?

PostgreSQL 不对函数使用索引,它只对运算符使用索引。

按 similarity() 排序的查询为每一行调用该函数,然后对行进行排序。

使用 % 的查询使用索引并对匹配的那些运行相似性函数(没有索引只扫描函数)。

如果你想按相似度最小(如问题)排序相似度大于 0.2 的那些,你应该使用 distance operator <->.

像这样:

WHERE "User"."displayName" <-> 'John' < 0.8
ORDER BY "User"."displayName" <-> 'John' DESC

距离为 1- 相似度因此为 0.8

我无法对 的回答添加评论。我认为对于以下查询:

WHERE "User"."displayName" <-> 'John' < 0.8
ORDER BY "User"."displayName" <-> 'John' DESC

索引也不会用到。

您可以使用以下查询:

SELECT set_limit(0.2);
...
WHERE "User"."displayName" % 'John'
ORDER BY "User"."displayName" <-> 'John' DESC

根据我的经验,GIST 索引对于相似性排序的效果更好/更快。

在这个例子中,我有客户 table 约 500k 行。

select *,similarity(coalesce(details::text,'') || coalesce(name,''),'9') 
  from customer 
  order by (coalesce(details::text,'') || coalesce(name,'')) <-> '9' 
  asc limit 50;

没有任何索引查询大约需要 8.5 秒,查询计划:

                              QUERY PLAN                                          
-----------------------------------------------------------------------------------
 Limit  (cost=47687.03..47687.16 rows=50 width=1144)
   ->  Sort  (cost=47687.03..49184.52 rows=598995 width=1144)
         Sort Key: (((COALESCE((details)::text, ''::text) ||
                     (COALESCE(name, ''::character varying))::text) <-> '9'::text))
         ->  Seq Scan on customer  (cost=0.00..27788.85 rows=598995 width=1144)
(4 rows)

添加GIN索引时:

CREATE INDEX ON customer USING gin ((coalesce(details::text,'') || coalesce(name,'')) gin_trgm_ops);

没有任何反应。查询计划看起来仍然相同,查询仍然需要大约 8.5 秒才能完成。没有索引用于排序。

创建 GIST 索引后:

CREATE INDEX ON customer USING gist ((coalesce(details::text,'') || coalesce(name,'')) gist_trgm_ops);

查询大约需要 240 毫秒,查询计划显示正在使用索引

                     QUERY PLAN                         
--------------------------------------------------------------------------
 Limit  (cost=0.42..10.19 rows=50 width=1144)
   ->  Index Scan using customer_expr_idx1 on customer  (cost=0.42..117106.73 rows=598995 width=1144)
     Order By: ((COALESCE((details)::text, ''::text) || 
                (COALESCE(name, ''::character varying))::text) <-> '9'::text)
(3 rows) 

出于好奇,返回的行如下所示:

   id   |           name           |        details         | similarity 
--------+--------------------------+------------------------+------------
     25 | Generic Company (9) Inc. |                        |  0.0909091
    125 | Generic Company (9) Inc. |                        |  0.0909091
 268649 | 9bg1ubTCYo7mMcDaHmCC     | { "fatty": "McDaddy" } |  0.0294118
 470217 | 9hSXtDmW9cXvKk4Q6McD     | { "fatty": "McDaddy" } |  0.0285714
 180775 | 9pRPi1w9nqV9999g2ceo     | { "fatty": "McDaddy" } |  0.0285714
 162931 | 9qMyYbWNJLZdv7uYYbOl     | { "fatty": "McDaddy" } |  0.0285714
 176961 | 9ow1NcTjAmCDyRsapDl4     | { "fatty": "McDaddy" } |  0.0285714
   ... etc ...