PostgreSQL citext 如何存储在 b 树索引中？小写还是原样？

Question

我在 PostgreSQL 中对所有文本列类型使用 citext。我想知道 citext 性能。

我对具有 b 树索引的文本列执行了简单的 WHERE 语句基准测试，但我看不出查询成本有任何差异。

例如：

Select * From table_text where a = '1';

Select * From table_citext where a= '1';

这些查询具有相同的查询成本。

据我了解，citext 会按原样存储字符串，而不会将其转换为小写。因此，当在 WHERE 子句中使用一个值时，它使用 lower 函数对 b-tree 索引的每个节点中的每个比较（我使用了 b-tree 索引）。

如果真如我所说，这应该会导致性能问题，但实际上并没有。

PostgreSQL 是如何做到这一点的？
PostgreSQL 如何在 b 树索引中存储 citext 列值？

Answer 1

citext 按输入的形式存储，不转换为小写。这也适用于作为 b 树索引键的存储。

魔术发生在 citext:

的比较函数中

/*
 * citextcmp()
 * Internal comparison function for citext strings.
 * Returns int32 negative, zero, or positive.
 */
static int32
citextcmp(text *left, text *right, Oid collid)
{
    char       *lcstr,
               *rcstr;
    int32       result;

    /*
     * We must do our str_tolower calls with DEFAULT_COLLATION_OID, not the
     * input collation as you might expect.  This is so that the behavior of
     * citext's equality and hashing functions is not collation-dependent.  We
     * should change this once the core infrastructure is able to cope with
     * collation-dependent equality and hashing functions.
     */

    lcstr = str_tolower(VARDATA_ANY(left), VARSIZE_ANY_EXHDR(left), DEFAULT_COLLATION_OID);
    rcstr = str_tolower(VARDATA_ANY(right), VARSIZE_ANY_EXHDR(right), DEFAULT_COLLATION_OID);

    result = varstr_cmp(lcstr, strlen(lcstr),
                        rcstr, strlen(rcstr),
                        collid);

    pfree(lcstr);
    pfree(rcstr);

    return result;
}

所以是的，这应该会产生一些开销。贵不贵还要看数据库的默认排序规则。

我将使用不带索引的查询来演示这一点。我正在使用德语排序规则：

SHOW lc_collate;
 lc_collate 
------------
 de_DE.utf8
(1 row)

首先使用text:

CREATE TABLE large_text(t text NOT NULL);

INSERT INTO large_text
   SELECT i||'text'
   FROM generate_series(1, 1000000) AS i;

VACUUM (FREEZE, ANALYZE) large_text;

\timing on

SELECT * FROM large_text WHERE t = TEXT 'mama';
 t 
---
(0 rows)

Time: 79.862 ms

现在用 citext 做同样的实验：

CREATE TABLE large_citext(t citext NOT NULL);

INSERT INTO large_citext
   SELECT i||'text'
   FROM generate_series(1, 1000000) AS i;

VACUUM (FREEZE, ANALYZE) large_citext;

\timing on

SELECT * FROM large_citext WHERE t = CITEXT 'mama';
 t 
---
(0 rows)

Time: 567.739 ms

所以 citext 大约慢了七倍。

但不要忘记，这些实验中的每一个都执行了具有一百万次比较的顺序扫描。
如果使用索引，差异不会很明显：

CREATE INDEX ON large_text (t);

Time: 5443.993 ms (00:05.444)

SELECT * FROM large_text WHERE t = CITEXT 'mama';
 t 
---
(0 rows)

Time: 1.867 ms


CREATE INDEX ON large_citext (t);

Time: 28009.904 ms (00:28.010)

SELECT * FROM large_citext WHERE t = CITEXT 'mama';
 t 
---
(0 rows)

Time: 1.988 ms

您看到 CREATE INDEX 对 citext 列花费的时间更长（它必须执行大量比较），但查询花费的时间大致相同。

原因是如果您使用索引扫描，您只需要很少的比较：对于您访问的 2-3 个索引块中的每一个，您都执行二进制搜索，并且您可能需要重新检查 table 在位图索引扫描的情况下找到的行。

PostgreSQL citext 如何存储在 b 树索引中？小写还是原样？

How is PostgreSQL citext stored in a b-tree index? Lower case or as it is?

postgresql

case-insensitive

query-performance

b-tree-index