PostgreSQL 的两个索引哪个更有效?
Which of the two PostgreSQL indexes is more efficient?
我有以下 PostgreSQL 模式:
CREATE TABLE User (
ID INTEGER PRIMARY KEY
);
CREATE TABLE BOX (
ID INTEGER PRIMARY KEY
);
CREATE SEQUENCE seq_item;
CREATE TABLE Item (
ID INTEGER PRIMARY KEY DEFAULT nextval('seq_item'),
SENDER INTEGER REFERENCES User(id),
RECEIVER INTEGER REFERENCES User(id),
INFO TEXT,
BOX_ID INTEGER REFERENCES Box(id) NOT NULL,
ARRIVAL TIMESTAMP
);
它的主要用例是一个典型的producer/consumer场景。不同的用户可以在数据库中为特定用户在特定的盒子中插入一个项目,每个用户都可以在一个地址为 her/him 的盒子中检索最上面的(这意味着最旧的)项目。它或多或少地模仿了数据库级别队列的功能。
更准确地说,最常见的操作如下:
INSERT INTO ITEM(SENDER, RECEIVER, INFO, BOX_ID, ARRIVAL)
VALUES (nsid, nrid, ncontent, nqid, ntime);
并根据 RECEIVER+SENDER
或 RECEIVER+BOX_ID
:
的组合检索命令
SELECT * INTO it FROM Item i WHERE (i.RECEIVER=? OR i.RECEIVER is NULL) AND
(i.BOX_ID=?) ORDER BY ARRIVAL LIMIT 1;
DELETE FROM Item i WHERE i.id=it.id;
和
SELECT * INTO it FROM Item i WHERE (i.RECEIVER=? OR i.RECEIVER is NULL) AND
(i.SENDER=?) ORDER BY ARRIVAL LIMIT 1;
DELETE FROM Item i WHERE i.id=it.id;
最后两个片段打包在一个存储过程中。
我考虑过使用两个不同的索引。
1. CREATE INDEX ind ON item(arrival);
。上述 SELECT
的 EXPLAIN
计划如下:
Limit (cost=0.29..2.07 rows=1 width=35)
-> Index Scan using ind on item i (cost=0.29..3010.81 rows=1693 width=35)
Filter: (((receiver = 2) OR (receiver IS NULL)) AND (sender = 2))
据我了解,这种方法的优点是我避免了对数据进行排序。然而,据我了解,我仍然需要扫描整个 table,但访问将是随机的,这会减慢执行速度。我不确定的是,由于LIMIT 1
,一旦找到匹配项,执行是否会立即停止,或者它将始终扫描整个table。
2. CREATE INDEX ind ON item(receiver, sender);
EXPLAIN
:
Limit (cost=512.23..512.23 rows=1 width=35)
-> Sort (cost=512.23..516.46 rows=1693 width=35)
Sort Key: arrival
-> Bitmap Heap Scan on message m (cost=42.37..503.76 rows=1693 width=35)
Recheck Cond: (((receiver = 2) AND (sender = 2)) OR ((receiver IS NULL) AND (sender = 2)))
-> BitmapOr (cost=42.37..42.37 rows=1693 width=0)
-> Bitmap Index Scan on ind (cost=0.00..37.22 rows=1693 width=0)
Index Cond: ((receiver = 2) AND (sender = 2))
-> Bitmap Index Scan on ind (cost=0.00..4.30 rows=1 width=0)
Index Cond: ((receiver IS NULL) AND (sender = 2))
在这种情况下,我可以高效地找到 receiver
和 sender
的匹配项,但之后我需要对结果进行排序,这可能会很慢。
那么这两个选项中哪个更好,为什么?第一个的估计成本要低得多,但第二个似乎要多 "deterministic".
对于此查询:
SELECT * INTO it
FROM Item i
WHERE (i.RECEIVER = ? OR i.RECEIVER is NULL) AND
(i.SENDER = ?)
ORDER BY ARRIVAL
LIMIT 1;
最好的索引可能是 item(sender, arrival, receiver)
,按此顺序。这将按发件人过滤,然后使用索引进行排序,然后再次按接收者过滤。
最快的方法可能是:
select *
from ((select i.*
from item i
where receiver = ? and sender = ?
order by arrival
limit 1
) union all
(select i.*
from item i
where receiver is null and sender = ?
order by arrival
limit 1
)
) i
order by arrival
limit 1;
此版本的最佳索引是 item(sender, receiver, arrival)
。它将使用索引在每个子查询中获取(最多)一行。最后的排序(两行)可以忽略不计。
当然,同样的推理也适用于其他查询。
我有以下 PostgreSQL 模式:
CREATE TABLE User (
ID INTEGER PRIMARY KEY
);
CREATE TABLE BOX (
ID INTEGER PRIMARY KEY
);
CREATE SEQUENCE seq_item;
CREATE TABLE Item (
ID INTEGER PRIMARY KEY DEFAULT nextval('seq_item'),
SENDER INTEGER REFERENCES User(id),
RECEIVER INTEGER REFERENCES User(id),
INFO TEXT,
BOX_ID INTEGER REFERENCES Box(id) NOT NULL,
ARRIVAL TIMESTAMP
);
它的主要用例是一个典型的producer/consumer场景。不同的用户可以在数据库中为特定用户在特定的盒子中插入一个项目,每个用户都可以在一个地址为 her/him 的盒子中检索最上面的(这意味着最旧的)项目。它或多或少地模仿了数据库级别队列的功能。
更准确地说,最常见的操作如下:
INSERT INTO ITEM(SENDER, RECEIVER, INFO, BOX_ID, ARRIVAL)
VALUES (nsid, nrid, ncontent, nqid, ntime);
并根据 RECEIVER+SENDER
或 RECEIVER+BOX_ID
:
SELECT * INTO it FROM Item i WHERE (i.RECEIVER=? OR i.RECEIVER is NULL) AND
(i.BOX_ID=?) ORDER BY ARRIVAL LIMIT 1;
DELETE FROM Item i WHERE i.id=it.id;
和
SELECT * INTO it FROM Item i WHERE (i.RECEIVER=? OR i.RECEIVER is NULL) AND
(i.SENDER=?) ORDER BY ARRIVAL LIMIT 1;
DELETE FROM Item i WHERE i.id=it.id;
最后两个片段打包在一个存储过程中。
我考虑过使用两个不同的索引。
1. CREATE INDEX ind ON item(arrival);
。上述 SELECT
的 EXPLAIN
计划如下:
Limit (cost=0.29..2.07 rows=1 width=35)
-> Index Scan using ind on item i (cost=0.29..3010.81 rows=1693 width=35)
Filter: (((receiver = 2) OR (receiver IS NULL)) AND (sender = 2))
据我了解,这种方法的优点是我避免了对数据进行排序。然而,据我了解,我仍然需要扫描整个 table,但访问将是随机的,这会减慢执行速度。我不确定的是,由于LIMIT 1
,一旦找到匹配项,执行是否会立即停止,或者它将始终扫描整个table。
2. CREATE INDEX ind ON item(receiver, sender);
EXPLAIN
:
Limit (cost=512.23..512.23 rows=1 width=35)
-> Sort (cost=512.23..516.46 rows=1693 width=35)
Sort Key: arrival
-> Bitmap Heap Scan on message m (cost=42.37..503.76 rows=1693 width=35)
Recheck Cond: (((receiver = 2) AND (sender = 2)) OR ((receiver IS NULL) AND (sender = 2)))
-> BitmapOr (cost=42.37..42.37 rows=1693 width=0)
-> Bitmap Index Scan on ind (cost=0.00..37.22 rows=1693 width=0)
Index Cond: ((receiver = 2) AND (sender = 2))
-> Bitmap Index Scan on ind (cost=0.00..4.30 rows=1 width=0)
Index Cond: ((receiver IS NULL) AND (sender = 2))
在这种情况下,我可以高效地找到 receiver
和 sender
的匹配项,但之后我需要对结果进行排序,这可能会很慢。
那么这两个选项中哪个更好,为什么?第一个的估计成本要低得多,但第二个似乎要多 "deterministic".
对于此查询:
SELECT * INTO it
FROM Item i
WHERE (i.RECEIVER = ? OR i.RECEIVER is NULL) AND
(i.SENDER = ?)
ORDER BY ARRIVAL
LIMIT 1;
最好的索引可能是 item(sender, arrival, receiver)
,按此顺序。这将按发件人过滤,然后使用索引进行排序,然后再次按接收者过滤。
最快的方法可能是:
select *
from ((select i.*
from item i
where receiver = ? and sender = ?
order by arrival
limit 1
) union all
(select i.*
from item i
where receiver is null and sender = ?
order by arrival
limit 1
)
) i
order by arrival
limit 1;
此版本的最佳索引是 item(sender, receiver, arrival)
。它将使用索引在每个子查询中获取(最多)一行。最后的排序(两行)可以忽略不计。
当然,同样的推理也适用于其他查询。