通过大型 table 上的操作来加速组
Speeding up a group by operation on large table
我有两个大的 tables,tokens
(100.000s 的条目)和 buy_orders
(1.000.000s 的条目)我需要有效地加入和分组。
如下所示,由合约地址(20 字节十六进制字符串)和 ID(256 字节整数)唯一标识的代币:
TABLE tokens (
contract TEXT NOT NULL
token_id NUMERIC(78, 0) NOT NULL
top_bid NUMERIC(78, 0)
PRIMARY KEY (contract, token_id)
)
用户可以post 对各种代币出价。出价具有有效期(通过时间范围表示)和价格(256 字节整数)。出价只能是以下两种类型之一:
- 类型 1:单一合约,范围 token_ids(例如
contract + start_token_id + end_token_id
)
- 类型 2:多个合同,多个 token_ids(例如
[(contract1 + token_id1), (contract2 + token_id2), ...]
)
以下是保留出价的table。它是高度非规范化的,以适应出价可能具有的 2 种可能类型。
TABLE buy_orders (
id INT NOT NULL PRIMARY KEY
contract TEXT
start_token_id NUMERIC(78, 0)
end_token_id NUMERIC(78, 0)
token_list_id INT REFERENCES token_lists(id)
price NUMERIC(78, 0) NOT NULL,
valid_between TSTZRANGE NOT NULL,
cancelled BOOLEAN NOT NULL,
executed BOOLEAN NOT NULL
INDEX ON (contract, start_token_id, end_token_id DESC)
INDEX ON (token_list_id)
INDEX ON (price)
INDEX ON (cancelled, executed)
INDEX ON (valid_between) USING gist
)
这里是相应的 tables 持有属于每个列表的令牌:
TABLE token_lists (
id INT PRIMARY KEY
)
TABLE token_lists_tokens (
token_list_id INT NOT NULL REFERENCES token_lists(id)
contract TEXT NOT NULL
token_id NUMERIC(78, 0) NOT NULL
FOREIGN KEY (contract, token_id) REFERENCES tokens(address, id)
INDEX ON (contract, token_id)
)
正如您在 tokens
table 中看到的那样,它会跟踪最高出价,以便尽可能高效地检索令牌数据(我们将有一个分页的 API 用于检索地址的所有代币,包括它们当前的最高出价)。随着新出价的出现、获得 cancelled/filled 或过期,我需要一种有效的方法来更新出价所针对的代币的最高出价。这对于类型 2 的出价不是问题,因为它们在大多数情况下会引用微不足道的代币数量,但它会为类型 1 的出价带来问题,因为在这种情况下,我可能需要重新计算 100.000 秒的最高出价有效地标记(例如,类型 2 出价的范围可以是 [1, 100.000]
)。这是我现在正在使用的查询(我限制了结果,否则它需要永远):
SELECT t.contract, t.token_id, max(b.price) FROM tokens t
JOIN buy_orders b ON t.contract = b.contract AND b.start_token_id <= t.token_id AND t.token_id <= b.end_token_id
WHERE t.contract = 'foo' AND NOT b.cancelled AND NOT b.filled AND b.valid_between @> now()
GROUP BY t.contract, t.token_id
LIMIT 1000
这是它的执行计划:
Limit (cost=5016.77..506906.79 rows=1000 width=81) (actual time=378.231..19260.361 rows=1000 loops=1)
-> GroupAggregate (cost=5016.77..37281894.72 rows=74273 width=81) (actual time=123.729..19005.567 rows=1000 loops=1)
Group Key: t.contract, t.token_id
-> Nested Loop (cost=5016.77..35589267.24 rows=225584633 width=54) (actual time=83.885..18953.853 rows=412253 loops=1)
Join Filter: ((b.start_token_id <= t.token_id) AND (t.token_id <= b.end_token_id))
Rows Removed by Join Filter: 140977658
-> Index Only Scan using tokens_pk on tokens t (cost=0.55..8186.80 rows=99100 width=49) (actual time=0.030..5.394 rows=11450 loops=1)
Index Cond: (contract = 'foo'::text)
Heap Fetches: 0
-> Materialize (cost=5016.21..51551.91 rows=20487 width=60) (actual time=0.001..0.432 rows=12348 loops=11450)
-> Bitmap Heap Scan on buy_orders b (cost=5016.21..51449.47 rows=20487 width=60) (actual time=15.245..116.099 rows=12349 loops=1)
Recheck Cond: (contract = 'foo'::text)
Filter: ((NOT cancelled) AND (NOT filled) AND (valid_between @> now()))
Rows Removed by Filter: 87771
Heap Blocks: exact=33525
-> Bitmap Index Scan on buy_orders_contract_start_token_id_end_token_id_index (cost=0.00..5011.09 rows=108072 width=0) (actual time=10.835..10.835 rows=100120 loops=1)
Index Cond: (contract = 'foo'::text)
Planning Time: 0.816 ms
JIT:
Functions: 15
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 3.922 ms, Inlining 106.877 ms, Optimization 99.947 ms, Emission 47.445 ms, Total 258.190 ms
Execution Time: 19264.851 ms
我正在寻找的是一种提高此特定查询效率的方法(如果可能)或其他建议以实现相同的结果。
我正在使用 Postgres 13。
部分的多列索引可能会有所帮助。如;
CREATE INDEX ON buy_orders (contract, valid_between) -- Multiple fields
INCLUDE (price) -- non-key column for index only scan
WHERE -- represents partial index
NOT cancelled AND
NOT filled;
这将允许 buy_orders
上的索引扫描删除更多行,这样您就不会得到
Rows Removed by Join Filter: 140977658
这就是使您的查询变得昂贵的原因。
我有两个大的 tables,tokens
(100.000s 的条目)和 buy_orders
(1.000.000s 的条目)我需要有效地加入和分组。
如下所示,由合约地址(20 字节十六进制字符串)和 ID(256 字节整数)唯一标识的代币:
TABLE tokens (
contract TEXT NOT NULL
token_id NUMERIC(78, 0) NOT NULL
top_bid NUMERIC(78, 0)
PRIMARY KEY (contract, token_id)
)
用户可以post 对各种代币出价。出价具有有效期(通过时间范围表示)和价格(256 字节整数)。出价只能是以下两种类型之一:
- 类型 1:单一合约,范围 token_ids(例如
contract + start_token_id + end_token_id
) - 类型 2:多个合同,多个 token_ids(例如
[(contract1 + token_id1), (contract2 + token_id2), ...]
)
以下是保留出价的table。它是高度非规范化的,以适应出价可能具有的 2 种可能类型。
TABLE buy_orders (
id INT NOT NULL PRIMARY KEY
contract TEXT
start_token_id NUMERIC(78, 0)
end_token_id NUMERIC(78, 0)
token_list_id INT REFERENCES token_lists(id)
price NUMERIC(78, 0) NOT NULL,
valid_between TSTZRANGE NOT NULL,
cancelled BOOLEAN NOT NULL,
executed BOOLEAN NOT NULL
INDEX ON (contract, start_token_id, end_token_id DESC)
INDEX ON (token_list_id)
INDEX ON (price)
INDEX ON (cancelled, executed)
INDEX ON (valid_between) USING gist
)
这里是相应的 tables 持有属于每个列表的令牌:
TABLE token_lists (
id INT PRIMARY KEY
)
TABLE token_lists_tokens (
token_list_id INT NOT NULL REFERENCES token_lists(id)
contract TEXT NOT NULL
token_id NUMERIC(78, 0) NOT NULL
FOREIGN KEY (contract, token_id) REFERENCES tokens(address, id)
INDEX ON (contract, token_id)
)
正如您在 tokens
table 中看到的那样,它会跟踪最高出价,以便尽可能高效地检索令牌数据(我们将有一个分页的 API 用于检索地址的所有代币,包括它们当前的最高出价)。随着新出价的出现、获得 cancelled/filled 或过期,我需要一种有效的方法来更新出价所针对的代币的最高出价。这对于类型 2 的出价不是问题,因为它们在大多数情况下会引用微不足道的代币数量,但它会为类型 1 的出价带来问题,因为在这种情况下,我可能需要重新计算 100.000 秒的最高出价有效地标记(例如,类型 2 出价的范围可以是 [1, 100.000]
)。这是我现在正在使用的查询(我限制了结果,否则它需要永远):
SELECT t.contract, t.token_id, max(b.price) FROM tokens t
JOIN buy_orders b ON t.contract = b.contract AND b.start_token_id <= t.token_id AND t.token_id <= b.end_token_id
WHERE t.contract = 'foo' AND NOT b.cancelled AND NOT b.filled AND b.valid_between @> now()
GROUP BY t.contract, t.token_id
LIMIT 1000
这是它的执行计划:
Limit (cost=5016.77..506906.79 rows=1000 width=81) (actual time=378.231..19260.361 rows=1000 loops=1)
-> GroupAggregate (cost=5016.77..37281894.72 rows=74273 width=81) (actual time=123.729..19005.567 rows=1000 loops=1)
Group Key: t.contract, t.token_id
-> Nested Loop (cost=5016.77..35589267.24 rows=225584633 width=54) (actual time=83.885..18953.853 rows=412253 loops=1)
Join Filter: ((b.start_token_id <= t.token_id) AND (t.token_id <= b.end_token_id))
Rows Removed by Join Filter: 140977658
-> Index Only Scan using tokens_pk on tokens t (cost=0.55..8186.80 rows=99100 width=49) (actual time=0.030..5.394 rows=11450 loops=1)
Index Cond: (contract = 'foo'::text)
Heap Fetches: 0
-> Materialize (cost=5016.21..51551.91 rows=20487 width=60) (actual time=0.001..0.432 rows=12348 loops=11450)
-> Bitmap Heap Scan on buy_orders b (cost=5016.21..51449.47 rows=20487 width=60) (actual time=15.245..116.099 rows=12349 loops=1)
Recheck Cond: (contract = 'foo'::text)
Filter: ((NOT cancelled) AND (NOT filled) AND (valid_between @> now()))
Rows Removed by Filter: 87771
Heap Blocks: exact=33525
-> Bitmap Index Scan on buy_orders_contract_start_token_id_end_token_id_index (cost=0.00..5011.09 rows=108072 width=0) (actual time=10.835..10.835 rows=100120 loops=1)
Index Cond: (contract = 'foo'::text)
Planning Time: 0.816 ms
JIT:
Functions: 15
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 3.922 ms, Inlining 106.877 ms, Optimization 99.947 ms, Emission 47.445 ms, Total 258.190 ms
Execution Time: 19264.851 ms
我正在寻找的是一种提高此特定查询效率的方法(如果可能)或其他建议以实现相同的结果。
我正在使用 Postgres 13。
部分的多列索引可能会有所帮助。如;
CREATE INDEX ON buy_orders (contract, valid_between) -- Multiple fields
INCLUDE (price) -- non-key column for index only scan
WHERE -- represents partial index
NOT cancelled AND
NOT filled;
这将允许 buy_orders
上的索引扫描删除更多行,这样您就不会得到
Rows Removed by Join Filter: 140977658
这就是使您的查询变得昂贵的原因。