在 PostgreSQL 中的联接表上使用 where 条件执行 select 的最佳方法

Question

我目前正在为我的网站构建搜索，我正在努力为我的 PostgreSQL 查询提供合理的性能，表面上看起来很简单。

假设我有两个 tables:

1000 万行的订单（id、order_name、买家、状态）

products_purchased (id, order_id, product_reference) 有 20M 行

假设我现在要查找包含特定产品的前 20 个订单。

SELECT
  orders.id
FROM
  orders
INNER JOIN products_purchased ON products_purchased.order_id = orders.id
WHERE products_purchased.product_reference = 7
ORDER BY orders.id ASC
LIMIT 20

此查询大约需要 5 到 120 秒。然而，我有所有适当的索引。

这似乎是因为 ORDER BY 子句。

随着我添加更多产品，问题会越来越大。例如，假设我在 1 年内添加了一个新的 product_reference。如果我执行搜索以获取前 20 个订单，则可能需要更长的时间，因为需要扫描整个 table 才能找到前 2 个订单。

对大型数据集执行此类搜索的最佳做法是什么？

非常感谢您的帮助！

--- 附加数据 ---

我在需要的地方有索引：

orders.id
products_purchased.order_id
products_purchased.product_reference

实际数据库大小为：

订单：16M
products_purchased: 20M

选择所有 product_reference = 2000 的订单需要 120 秒，尽管 products_purchased table 只有 46,000 次 product_reference=2000。

执行计划如下：

[
  {
    "Plan": {
      "Node Type": "Limit",
      "Parallel Aware": false,
      "Startup Cost": 0.87,
      "Total Cost": 9846.45,
      "Plan Rows": 20,
      "Plan Width": 4,
      "Actual Startup Time": 59750.428,
      "Actual Total Time": 77196.124,
      "Actual Rows": 20,
      "Actual Loops": 1,
      "Plans": [
        {
          "Node Type": "Nested Loop",
          "Parent Relationship": "Outer",
          "Parallel Aware": false,
          "Join Type": "Inner",
          "Startup Cost": 0.87,
          "Total Cost": 18802091.94,
          "Plan Rows": 38194,
          "Plan Width": 4,
          "Actual Startup Time": 59750.426,
          "Actual Total Time": 77196.101,
          "Actual Rows": 20,
          "Actual Loops": 1,
          "Inner Unique": true,
          "Plans": [
            {
              "Node Type": "Index Scan",
              "Parent Relationship": "Outer",
              "Parallel Aware": false,
              "Scan Direction": "Forward",
              "Index Name": "products_purchased_order_id_idx",
              "Relation Name": "products_purchased",
              "Alias": "products",
              "Startup Cost": 0.44,
              "Total Cost": 18746328.16,
              "Plan Rows": 38194,
              "Plan Width": 4,
              "Actual Startup Time": 59746.776,
              "Actual Total Time": 77171.904,
              "Actual Rows": 20,
              "Actual Loops": 1,
              "Filter": "(product_reference = 2000)",
              "Rows Removed by Filter": 514614
            },
            {
              "Node Type": "Index Only Scan",
              "Parent Relationship": "Inner",
              "Parallel Aware": false,
              "Scan Direction": "Forward",
              "Index Name": "orders_pkey",
              "Relation Name": "orders",
              "Alias": "orders",
              "Startup Cost": 0.43,
              "Total Cost": 1.46,
              "Plan Rows": 1,
              "Plan Width": 4,
              "Actual Startup Time": 1.197,
              "Actual Total Time": 1.197,
              "Actual Rows": 1,
              "Actual Loops": 20,
              "Index Cond": "(id = products_purchased.order_id)",
              "Rows Removed by Index Recheck": 0,
              "Heap Fetches": 10
            }
          ]
        }
      ]
    },
    "Planning Time": 7.893,
    "Triggers": [
    ],
    "Execution Time": 77196.878
  }
]

Answer 1

我已尝试重现您的问题，但无法得到您报告的高次数。我想我的数据分布与你的不匹配，但差异还是太大了。

我的设置：

CREATE EXTENSION IF NOT EXISTS "uuid-ossp";


create table orders
(
    id         serial not null,
    order_name text   not null,

    constraint orders_pkey primary key (id)
);

create table products_purchased
(
    id                serial  not null,
    order_id          integer not null,
    product_reference integer not null,


    constraint products_purchased_fkey_order foreign key (order_id) references orders (id)
);

alter sequence orders_id_seq cache 100000;
alter sequence products_purchased_id_seq cache 100000;

insert into orders(order_name)
select uuid_generate_v4()
from generate_series(1, 16000000);

insert into products_purchased(order_id, product_reference)
select random() * 15999999 + 1, random() * 10000 + 1
from generate_series(1, 20000000);

alter sequence orders_id_seq cache 1;
alter sequence products_purchased_id_seq cache 1;

create index products_purchased_order_id on products_purchased using btree (order_id);
create index products_purchased_product_ref on products_purchased using btree (product_reference);

vacuum analyse;

此外，鉴于您想要包含特定产品的前 N 个订单，您需要 select DISTINCT 订单 ID，否则您可能会收到重复的订单。

基线（您查询的结果）：

SELECT DISTINCT o.id
FROM orders o
         INNER JOIN products_purchased p ON p.order_id = o.id
WHERE p.product_reference = 2000
ORDER BY o.id ASC
LIMIT 20;

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1000.90..9677.24 rows=20 width=4) (actual time=184.548..424.877 rows=20 loops=1)
   ->  Unique  (cost=1000.90..866032.55 rows=1994 width=4) (actual time=184.547..424.868 rows=20 loops=1)
         ->  Nested Loop  (cost=1000.90..866027.56 rows=1994 width=4) (actual time=184.546..424.837 rows=20 loops=1)
               ->  Gather Merge  (cost=1000.46..857325.28 rows=1994 width=4) (actual time=184.491..463.432 rows=20 loops=1)
                     Workers Planned: 2
                     Workers Launched: 2
                     ->  Parallel Index Scan using products_purchased_order_id on products_purchased p  (cost=0.44..856095.10 rows=831 width=4) (actual time=70.818..334.005 rows=8 loops=3)
                           Filter: (product_reference = 2000)
                           Rows Removed by Filter: 65639
               ->  Index Only Scan using orders_pkey on orders o  (cost=0.43..4.36 rows=1 width=4) (actual time=0.018..0.018 rows=1 loops=20)
                     Index Cond: (id = p.order_id)
                     Heap Fetches: 0
 Planning Time: 0.408 ms
 Execution Time: 463.962 ms
(14 rows)

假设您不能修改索引

然后你可以使用半连接，至少在我的数据库上速度提高了 10 倍：

SELECT o.id
FROM orders o
WHERE exists(SELECT 1 FROM products_purchased p WHERE p.product_reference = 2000 AND p.order_id = o.id)
ORDER BY o.id ASC
LIMIT 20;

                                                                              QUERY PLAN                                                                              
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=8288.94..11689.62 rows=20 width=4) (actual time=23.580..39.330 rows=20 loops=1)
   ->  Gather Merge  (cost=8288.94..347337.07 rows=1994 width=4) (actual time=23.579..43.328 rows=20 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Merge Semi Join  (cost=7288.91..346106.89 rows=831 width=4) (actual time=13.660..28.522 rows=8 loops=3)
               Merge Cond: (o.id = p.order_id)
               ->  Parallel Index Only Scan using orders_pkey on orders o  (cost=0.43..322155.39 rows=6666680 width=4) (actual time=0.071..10.471 rows=52366 loops=3)
                     Heap Fetches: 0
               ->  Sort  (cost=7276.79..7281.77 rows=1994 width=4) (actual time=12.096..12.103 rows=23 loops=3)
                     Sort Key: p.order_id
                     Sort Method: quicksort  Memory: 194kB
                     Worker 0:  Sort Method: quicksort  Memory: 194kB
                     Worker 1:  Sort Method: quicksort  Memory: 194kB
                     ->  Bitmap Heap Scan on products_purchased p  (cost=40.02..7167.50 rows=1994 width=4) (actual time=1.618..10.885 rows=2074 loops=3)
                           Recheck Cond: (product_reference = 2000)
                           Heap Blocks: exact=2053
                           ->  Bitmap Index Scan on products_purchased_product_ref  (cost=0.00..39.52 rows=1994 width=0) (actual time=1.100..1.100 rows=2074 loops=3)
                                 Index Cond: (product_reference = 2000)
 Planning Time: 0.759 ms
 Execution Time: 43.426 ms
(20 rows)

Time: 44.853 ms

但是如果你可以修改索引，你可以创建更优化的索引：

create index products_purchased_idx on products_purchased using btree(product_reference, order_id);

然后您的原始查询运行甚至比半连接更快：

                                                                               QUERY PLAN                                                                                
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.87..88.79 rows=20 width=4) (actual time=0.109..0.291 rows=20 loops=1)
   ->  Unique  (cost=0.87..8766.60 rows=1994 width=4) (actual time=0.107..0.284 rows=20 loops=1)
         ->  Nested Loop  (cost=0.87..8761.62 rows=1994 width=4) (actual time=0.105..0.266 rows=20 loops=1)
               ->  Index Only Scan using products_purchased_idx on products_purchased p  (cost=0.44..59.33 rows=1994 width=4) (actual time=0.089..0.106 rows=20 loops=1)
                     Index Cond: (product_reference = 2000)
                     Heap Fetches: 0
               ->  Index Only Scan using orders_pkey on orders o  (cost=0.43..4.36 rows=1 width=4) (actual time=0.007..0.007 rows=1 loops=20)
                     Index Cond: (id = p.order_id)
                     Heap Fetches: 0
 Planning Time: 0.536 ms
 Execution Time: 0.365 ms
(11 rows)

Time: 1.368 ms

~0.4 毫秒的执行时间与旧索引的~464 毫秒相比，这是~1100 倍的加速:)

在 PostgreSQL 中的联接表上使用 where 条件执行 select 的最佳方法

Best way to perform select with where condition on joined tables in PostgreSQL

postgresql

lookup

performance

search

join

基线（您查询的结果）：

假设您不能修改索引

但是如果你可以修改索引，你可以创建更优化的索引：