在 PostgreSQL 中的联接表上使用 where 条件执行 select 的最佳方法
Best way to perform select with where condition on joined tables in PostgreSQL
我目前正在为我的网站构建搜索,我正在努力为我的 PostgreSQL 查询提供合理的性能,表面上看起来很简单。
假设我有两个 tables:
1000 万行的订单(id、order_name、买家、状态)
products_purchased (id, order_id, product_reference) 有 20M 行
假设我现在要查找包含特定产品的前 20 个订单。
SELECT
orders.id
FROM
orders
INNER JOIN products_purchased ON products_purchased.order_id = orders.id
WHERE products_purchased.product_reference = 7
ORDER BY orders.id ASC
LIMIT 20
此查询大约需要 5 到 120 秒。
然而,我有所有适当的索引。
这似乎是因为 ORDER BY 子句。
随着我添加更多产品,问题会越来越大。例如,假设我在 1 年内添加了一个新的 product_reference。如果我执行搜索以获取前 20 个订单,则可能需要更长的时间,因为需要扫描整个 table 才能找到前 2 个订单。
对大型数据集执行此类搜索的最佳做法是什么?
非常感谢您的帮助!
--- 附加数据 ---
我在需要的地方有索引:
orders.id
products_purchased.order_id
products_purchased.product_reference
实际数据库大小为:
订单:16M
products_purchased: 20M
选择所有 product_reference = 2000 的订单需要 120 秒,尽管 products_purchased table 只有 46,000 次 product_reference=2000。
执行计划如下:
[
{
"Plan": {
"Node Type": "Limit",
"Parallel Aware": false,
"Startup Cost": 0.87,
"Total Cost": 9846.45,
"Plan Rows": 20,
"Plan Width": 4,
"Actual Startup Time": 59750.428,
"Actual Total Time": 77196.124,
"Actual Rows": 20,
"Actual Loops": 1,
"Plans": [
{
"Node Type": "Nested Loop",
"Parent Relationship": "Outer",
"Parallel Aware": false,
"Join Type": "Inner",
"Startup Cost": 0.87,
"Total Cost": 18802091.94,
"Plan Rows": 38194,
"Plan Width": 4,
"Actual Startup Time": 59750.426,
"Actual Total Time": 77196.101,
"Actual Rows": 20,
"Actual Loops": 1,
"Inner Unique": true,
"Plans": [
{
"Node Type": "Index Scan",
"Parent Relationship": "Outer",
"Parallel Aware": false,
"Scan Direction": "Forward",
"Index Name": "products_purchased_order_id_idx",
"Relation Name": "products_purchased",
"Alias": "products",
"Startup Cost": 0.44,
"Total Cost": 18746328.16,
"Plan Rows": 38194,
"Plan Width": 4,
"Actual Startup Time": 59746.776,
"Actual Total Time": 77171.904,
"Actual Rows": 20,
"Actual Loops": 1,
"Filter": "(product_reference = 2000)",
"Rows Removed by Filter": 514614
},
{
"Node Type": "Index Only Scan",
"Parent Relationship": "Inner",
"Parallel Aware": false,
"Scan Direction": "Forward",
"Index Name": "orders_pkey",
"Relation Name": "orders",
"Alias": "orders",
"Startup Cost": 0.43,
"Total Cost": 1.46,
"Plan Rows": 1,
"Plan Width": 4,
"Actual Startup Time": 1.197,
"Actual Total Time": 1.197,
"Actual Rows": 1,
"Actual Loops": 20,
"Index Cond": "(id = products_purchased.order_id)",
"Rows Removed by Index Recheck": 0,
"Heap Fetches": 10
}
]
}
]
},
"Planning Time": 7.893,
"Triggers": [
],
"Execution Time": 77196.878
}
]
我已尝试重现您的问题,但无法得到您报告的高次数。我想我的数据分布与你的不匹配,但差异还是太大了。
我的设置:
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
create table orders
(
id serial not null,
order_name text not null,
constraint orders_pkey primary key (id)
);
create table products_purchased
(
id serial not null,
order_id integer not null,
product_reference integer not null,
constraint products_purchased_fkey_order foreign key (order_id) references orders (id)
);
alter sequence orders_id_seq cache 100000;
alter sequence products_purchased_id_seq cache 100000;
insert into orders(order_name)
select uuid_generate_v4()
from generate_series(1, 16000000);
insert into products_purchased(order_id, product_reference)
select random() * 15999999 + 1, random() * 10000 + 1
from generate_series(1, 20000000);
alter sequence orders_id_seq cache 1;
alter sequence products_purchased_id_seq cache 1;
create index products_purchased_order_id on products_purchased using btree (order_id);
create index products_purchased_product_ref on products_purchased using btree (product_reference);
vacuum analyse;
此外,鉴于您想要包含特定产品的前 N 个订单,您需要 select DISTINCT 订单 ID,否则您可能会收到重复的订单。
基线(您查询的结果):
SELECT DISTINCT o.id
FROM orders o
INNER JOIN products_purchased p ON p.order_id = o.id
WHERE p.product_reference = 2000
ORDER BY o.id ASC
LIMIT 20;
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1000.90..9677.24 rows=20 width=4) (actual time=184.548..424.877 rows=20 loops=1)
-> Unique (cost=1000.90..866032.55 rows=1994 width=4) (actual time=184.547..424.868 rows=20 loops=1)
-> Nested Loop (cost=1000.90..866027.56 rows=1994 width=4) (actual time=184.546..424.837 rows=20 loops=1)
-> Gather Merge (cost=1000.46..857325.28 rows=1994 width=4) (actual time=184.491..463.432 rows=20 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Index Scan using products_purchased_order_id on products_purchased p (cost=0.44..856095.10 rows=831 width=4) (actual time=70.818..334.005 rows=8 loops=3)
Filter: (product_reference = 2000)
Rows Removed by Filter: 65639
-> Index Only Scan using orders_pkey on orders o (cost=0.43..4.36 rows=1 width=4) (actual time=0.018..0.018 rows=1 loops=20)
Index Cond: (id = p.order_id)
Heap Fetches: 0
Planning Time: 0.408 ms
Execution Time: 463.962 ms
(14 rows)
假设您不能修改索引
然后你可以使用半连接,至少在我的数据库上速度提高了 10 倍:
SELECT o.id
FROM orders o
WHERE exists(SELECT 1 FROM products_purchased p WHERE p.product_reference = 2000 AND p.order_id = o.id)
ORDER BY o.id ASC
LIMIT 20;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=8288.94..11689.62 rows=20 width=4) (actual time=23.580..39.330 rows=20 loops=1)
-> Gather Merge (cost=8288.94..347337.07 rows=1994 width=4) (actual time=23.579..43.328 rows=20 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Merge Semi Join (cost=7288.91..346106.89 rows=831 width=4) (actual time=13.660..28.522 rows=8 loops=3)
Merge Cond: (o.id = p.order_id)
-> Parallel Index Only Scan using orders_pkey on orders o (cost=0.43..322155.39 rows=6666680 width=4) (actual time=0.071..10.471 rows=52366 loops=3)
Heap Fetches: 0
-> Sort (cost=7276.79..7281.77 rows=1994 width=4) (actual time=12.096..12.103 rows=23 loops=3)
Sort Key: p.order_id
Sort Method: quicksort Memory: 194kB
Worker 0: Sort Method: quicksort Memory: 194kB
Worker 1: Sort Method: quicksort Memory: 194kB
-> Bitmap Heap Scan on products_purchased p (cost=40.02..7167.50 rows=1994 width=4) (actual time=1.618..10.885 rows=2074 loops=3)
Recheck Cond: (product_reference = 2000)
Heap Blocks: exact=2053
-> Bitmap Index Scan on products_purchased_product_ref (cost=0.00..39.52 rows=1994 width=0) (actual time=1.100..1.100 rows=2074 loops=3)
Index Cond: (product_reference = 2000)
Planning Time: 0.759 ms
Execution Time: 43.426 ms
(20 rows)
Time: 44.853 ms
但是如果你可以修改索引,你可以创建更优化的索引:
create index products_purchased_idx on products_purchased using btree(product_reference, order_id);
然后您的原始查询 运行 甚至比半连接更快:
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.87..88.79 rows=20 width=4) (actual time=0.109..0.291 rows=20 loops=1)
-> Unique (cost=0.87..8766.60 rows=1994 width=4) (actual time=0.107..0.284 rows=20 loops=1)
-> Nested Loop (cost=0.87..8761.62 rows=1994 width=4) (actual time=0.105..0.266 rows=20 loops=1)
-> Index Only Scan using products_purchased_idx on products_purchased p (cost=0.44..59.33 rows=1994 width=4) (actual time=0.089..0.106 rows=20 loops=1)
Index Cond: (product_reference = 2000)
Heap Fetches: 0
-> Index Only Scan using orders_pkey on orders o (cost=0.43..4.36 rows=1 width=4) (actual time=0.007..0.007 rows=1 loops=20)
Index Cond: (id = p.order_id)
Heap Fetches: 0
Planning Time: 0.536 ms
Execution Time: 0.365 ms
(11 rows)
Time: 1.368 ms
~0.4 毫秒的执行时间与旧索引的~464 毫秒相比,这是~1100 倍的加速:)
我目前正在为我的网站构建搜索,我正在努力为我的 PostgreSQL 查询提供合理的性能,表面上看起来很简单。
假设我有两个 tables:
1000 万行的订单(id、order_name、买家、状态)
products_purchased (id, order_id, product_reference) 有 20M 行
假设我现在要查找包含特定产品的前 20 个订单。
SELECT
orders.id
FROM
orders
INNER JOIN products_purchased ON products_purchased.order_id = orders.id
WHERE products_purchased.product_reference = 7
ORDER BY orders.id ASC
LIMIT 20
此查询大约需要 5 到 120 秒。 然而,我有所有适当的索引。
这似乎是因为 ORDER BY 子句。
随着我添加更多产品,问题会越来越大。例如,假设我在 1 年内添加了一个新的 product_reference。如果我执行搜索以获取前 20 个订单,则可能需要更长的时间,因为需要扫描整个 table 才能找到前 2 个订单。
对大型数据集执行此类搜索的最佳做法是什么?
非常感谢您的帮助!
--- 附加数据 ---
我在需要的地方有索引:
orders.id
products_purchased.order_id
products_purchased.product_reference
实际数据库大小为:
订单:16M
products_purchased: 20M
选择所有 product_reference = 2000 的订单需要 120 秒,尽管 products_purchased table 只有 46,000 次 product_reference=2000。
执行计划如下:
[
{
"Plan": {
"Node Type": "Limit",
"Parallel Aware": false,
"Startup Cost": 0.87,
"Total Cost": 9846.45,
"Plan Rows": 20,
"Plan Width": 4,
"Actual Startup Time": 59750.428,
"Actual Total Time": 77196.124,
"Actual Rows": 20,
"Actual Loops": 1,
"Plans": [
{
"Node Type": "Nested Loop",
"Parent Relationship": "Outer",
"Parallel Aware": false,
"Join Type": "Inner",
"Startup Cost": 0.87,
"Total Cost": 18802091.94,
"Plan Rows": 38194,
"Plan Width": 4,
"Actual Startup Time": 59750.426,
"Actual Total Time": 77196.101,
"Actual Rows": 20,
"Actual Loops": 1,
"Inner Unique": true,
"Plans": [
{
"Node Type": "Index Scan",
"Parent Relationship": "Outer",
"Parallel Aware": false,
"Scan Direction": "Forward",
"Index Name": "products_purchased_order_id_idx",
"Relation Name": "products_purchased",
"Alias": "products",
"Startup Cost": 0.44,
"Total Cost": 18746328.16,
"Plan Rows": 38194,
"Plan Width": 4,
"Actual Startup Time": 59746.776,
"Actual Total Time": 77171.904,
"Actual Rows": 20,
"Actual Loops": 1,
"Filter": "(product_reference = 2000)",
"Rows Removed by Filter": 514614
},
{
"Node Type": "Index Only Scan",
"Parent Relationship": "Inner",
"Parallel Aware": false,
"Scan Direction": "Forward",
"Index Name": "orders_pkey",
"Relation Name": "orders",
"Alias": "orders",
"Startup Cost": 0.43,
"Total Cost": 1.46,
"Plan Rows": 1,
"Plan Width": 4,
"Actual Startup Time": 1.197,
"Actual Total Time": 1.197,
"Actual Rows": 1,
"Actual Loops": 20,
"Index Cond": "(id = products_purchased.order_id)",
"Rows Removed by Index Recheck": 0,
"Heap Fetches": 10
}
]
}
]
},
"Planning Time": 7.893,
"Triggers": [
],
"Execution Time": 77196.878
}
]
我已尝试重现您的问题,但无法得到您报告的高次数。我想我的数据分布与你的不匹配,但差异还是太大了。
我的设置:
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
create table orders
(
id serial not null,
order_name text not null,
constraint orders_pkey primary key (id)
);
create table products_purchased
(
id serial not null,
order_id integer not null,
product_reference integer not null,
constraint products_purchased_fkey_order foreign key (order_id) references orders (id)
);
alter sequence orders_id_seq cache 100000;
alter sequence products_purchased_id_seq cache 100000;
insert into orders(order_name)
select uuid_generate_v4()
from generate_series(1, 16000000);
insert into products_purchased(order_id, product_reference)
select random() * 15999999 + 1, random() * 10000 + 1
from generate_series(1, 20000000);
alter sequence orders_id_seq cache 1;
alter sequence products_purchased_id_seq cache 1;
create index products_purchased_order_id on products_purchased using btree (order_id);
create index products_purchased_product_ref on products_purchased using btree (product_reference);
vacuum analyse;
此外,鉴于您想要包含特定产品的前 N 个订单,您需要 select DISTINCT 订单 ID,否则您可能会收到重复的订单。
基线(您查询的结果):
SELECT DISTINCT o.id
FROM orders o
INNER JOIN products_purchased p ON p.order_id = o.id
WHERE p.product_reference = 2000
ORDER BY o.id ASC
LIMIT 20;
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1000.90..9677.24 rows=20 width=4) (actual time=184.548..424.877 rows=20 loops=1)
-> Unique (cost=1000.90..866032.55 rows=1994 width=4) (actual time=184.547..424.868 rows=20 loops=1)
-> Nested Loop (cost=1000.90..866027.56 rows=1994 width=4) (actual time=184.546..424.837 rows=20 loops=1)
-> Gather Merge (cost=1000.46..857325.28 rows=1994 width=4) (actual time=184.491..463.432 rows=20 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Index Scan using products_purchased_order_id on products_purchased p (cost=0.44..856095.10 rows=831 width=4) (actual time=70.818..334.005 rows=8 loops=3)
Filter: (product_reference = 2000)
Rows Removed by Filter: 65639
-> Index Only Scan using orders_pkey on orders o (cost=0.43..4.36 rows=1 width=4) (actual time=0.018..0.018 rows=1 loops=20)
Index Cond: (id = p.order_id)
Heap Fetches: 0
Planning Time: 0.408 ms
Execution Time: 463.962 ms
(14 rows)
假设您不能修改索引
然后你可以使用半连接,至少在我的数据库上速度提高了 10 倍:
SELECT o.id
FROM orders o
WHERE exists(SELECT 1 FROM products_purchased p WHERE p.product_reference = 2000 AND p.order_id = o.id)
ORDER BY o.id ASC
LIMIT 20;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=8288.94..11689.62 rows=20 width=4) (actual time=23.580..39.330 rows=20 loops=1)
-> Gather Merge (cost=8288.94..347337.07 rows=1994 width=4) (actual time=23.579..43.328 rows=20 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Merge Semi Join (cost=7288.91..346106.89 rows=831 width=4) (actual time=13.660..28.522 rows=8 loops=3)
Merge Cond: (o.id = p.order_id)
-> Parallel Index Only Scan using orders_pkey on orders o (cost=0.43..322155.39 rows=6666680 width=4) (actual time=0.071..10.471 rows=52366 loops=3)
Heap Fetches: 0
-> Sort (cost=7276.79..7281.77 rows=1994 width=4) (actual time=12.096..12.103 rows=23 loops=3)
Sort Key: p.order_id
Sort Method: quicksort Memory: 194kB
Worker 0: Sort Method: quicksort Memory: 194kB
Worker 1: Sort Method: quicksort Memory: 194kB
-> Bitmap Heap Scan on products_purchased p (cost=40.02..7167.50 rows=1994 width=4) (actual time=1.618..10.885 rows=2074 loops=3)
Recheck Cond: (product_reference = 2000)
Heap Blocks: exact=2053
-> Bitmap Index Scan on products_purchased_product_ref (cost=0.00..39.52 rows=1994 width=0) (actual time=1.100..1.100 rows=2074 loops=3)
Index Cond: (product_reference = 2000)
Planning Time: 0.759 ms
Execution Time: 43.426 ms
(20 rows)
Time: 44.853 ms
但是如果你可以修改索引,你可以创建更优化的索引:
create index products_purchased_idx on products_purchased using btree(product_reference, order_id);
然后您的原始查询 运行 甚至比半连接更快:
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.87..88.79 rows=20 width=4) (actual time=0.109..0.291 rows=20 loops=1)
-> Unique (cost=0.87..8766.60 rows=1994 width=4) (actual time=0.107..0.284 rows=20 loops=1)
-> Nested Loop (cost=0.87..8761.62 rows=1994 width=4) (actual time=0.105..0.266 rows=20 loops=1)
-> Index Only Scan using products_purchased_idx on products_purchased p (cost=0.44..59.33 rows=1994 width=4) (actual time=0.089..0.106 rows=20 loops=1)
Index Cond: (product_reference = 2000)
Heap Fetches: 0
-> Index Only Scan using orders_pkey on orders o (cost=0.43..4.36 rows=1 width=4) (actual time=0.007..0.007 rows=1 loops=20)
Index Cond: (id = p.order_id)
Heap Fetches: 0
Planning Time: 0.536 ms
Execution Time: 0.365 ms
(11 rows)
Time: 1.368 ms
~0.4 毫秒的执行时间与旧索引的~464 毫秒相比,这是~1100 倍的加速:)