如何优化结合了 INNER JOIN、DISTINCT 和 WHERE 的 SQL 查询?
How to optimize a SQL query that combines INNER JOINs, DISTINCT and WHERE?
SELECT DISTINCT options.id, options.foo_option_id, options.description
FROM vehicles
INNER JOIN vehicle_options ON vehicle_options.vehicle_id = vehicles.id
INNER JOIN options ON options.id = vehicle_options.option_id
INNER JOIN discounted_vehicles ON vehicles.id = discounted_vehicles.vehicle_id
WHERE discounted_vehicles.discount_id = 4;
上面的查询 returns 我 2067 行,它在 1.7 秒内在本地运行。
我想知道它是否尽可能快,或者我是否可以以某种方式进一步调整它,因为这个数据集会随着时间的推移而快速增长。
我在没有速度变化的情况下尝试过的事情:
1 - 更改连接顺序,从最小到最大连接 table。
2 - 向 discounted_vehicles.discount_id 添加索引。
1 - Change the join order, joining from the smallest to the biggest table.
在幕后,PostgreSQL 根据 SQL 优化器设计的解释计划重新排列 table 的顺序。你写的顺序没有意义。
2 - Adding an index to discounted_vehicles.discount_id.
这取决于 discount_id
列的选择性。你认为它会过滤掉 95% 的行,只留下 5% 吗?如果剩下 5% 或更少,索引会有所帮助。否则完整 table 扫描会更快。
此外,如果还没有,我会添加索引:
vehicle_options (vehicle_id)
但也许它已经被外键创建了。
尝试使用 groupby 而不是 distinct
SELECT
"options"."id",
"options"."foo_option_id",
"options"."description"
FROM
"vehicles"
INNER JOIN "vehicle_options" ON "vehicle_options"."vehicle_id" = "vehicles"."id"
INNER JOIN "options" ON "options"."id" = "vehicle_options"."option_id"
INNER JOIN "discounted_vehicles" ON "vehicles"."id" = "discounted_vehicles"."vehicle_id"
WHERE
"discounted_vehicles"."discount_id" = 4
GROUP BY
"options.id";
不过,您需要先创建必要的索引,然后再尝试 运行 下面的查询
SELECT "options"."id", "options"."foo_option_id",
"options"."description"
FROM "vehicles"
INNER JOIN "vehicle_options"
ON "vehicle_options"."vehicle_id" = "vehicles"."id"
INNER JOIN "options"
ON "options"."id" = "vehicle_options"."option_id"
INNER JOIN "discounted_vehicles"
ON "vehicles"."id" = "discounted_vehicles"."vehicle_id"
WHERE "discounted_vehicles"."discount_id" = 4
GROUP BY options"."id", "options"."foo_option_id",
"options"."description"
最佳查询取决于缺失信息。
这在典型设置中应该快得多:
SELECT id, foo_option_id, description
FROM options o
WHERE EXISTS (
SELECT
FROM discounted_vehicles d
JOIN vehicle_options vo USING (vehicle_id)
WHERE d.discount_id = 4
AND vo.option_id = o.id
);
假设引用完整性,由 FK 约束强制执行,我们可以从查询中省略 table vehicle
并直接从 discounted_vehicles
连接到 vehicle_options
。
此外,如果每个不同选项有很多符合条件的行,EXISTS
通常会更快。
理想情况下,您应该在以下位置拥有多列索引:
discounted_vehicles(discount_id, vehicle_id)
vehicle_options(vehicle_id, option_id)
按此顺序索引列。您可能在提供第二个索引的 vehicle_options
上有 PK 约束,但列顺序应该匹配。相关:
- PostgreSQL composite primary key
- Is a composite index also good for queries on the first field?
根据实际数据分布情况,可能会有更快的查询方式。相关:
- Optimize GROUP BY query to retrieve latest record per user
- Select first row in each GROUP BY group?
更改 加入顺序 通常 无用。 Postgres 重新排序加入它期望最快的任何方式。 (例外情况适用。)相关:
- Sample Query to show Cardinality estimation error in PostgreSQL
SQL INNER JOIN over multiple tables equal to WHERE syntax
SELECT DISTINCT options.id, options.foo_option_id, options.description
FROM vehicles
INNER JOIN vehicle_options ON vehicle_options.vehicle_id = vehicles.id
INNER JOIN options ON options.id = vehicle_options.option_id
INNER JOIN discounted_vehicles ON vehicles.id = discounted_vehicles.vehicle_id
WHERE discounted_vehicles.discount_id = 4;
上面的查询 returns 我 2067 行,它在 1.7 秒内在本地运行。 我想知道它是否尽可能快,或者我是否可以以某种方式进一步调整它,因为这个数据集会随着时间的推移而快速增长。
我在没有速度变化的情况下尝试过的事情:
1 - 更改连接顺序,从最小到最大连接 table。
2 - 向 discounted_vehicles.discount_id 添加索引。
1 - Change the join order, joining from the smallest to the biggest table.
在幕后,PostgreSQL 根据 SQL 优化器设计的解释计划重新排列 table 的顺序。你写的顺序没有意义。
2 - Adding an index to discounted_vehicles.discount_id.
这取决于 discount_id
列的选择性。你认为它会过滤掉 95% 的行,只留下 5% 吗?如果剩下 5% 或更少,索引会有所帮助。否则完整 table 扫描会更快。
此外,如果还没有,我会添加索引:
vehicle_options (vehicle_id)
但也许它已经被外键创建了。
尝试使用 groupby 而不是 distinct
SELECT
"options"."id",
"options"."foo_option_id",
"options"."description"
FROM
"vehicles"
INNER JOIN "vehicle_options" ON "vehicle_options"."vehicle_id" = "vehicles"."id"
INNER JOIN "options" ON "options"."id" = "vehicle_options"."option_id"
INNER JOIN "discounted_vehicles" ON "vehicles"."id" = "discounted_vehicles"."vehicle_id"
WHERE
"discounted_vehicles"."discount_id" = 4
GROUP BY
"options.id";
不过,您需要先创建必要的索引,然后再尝试 运行 下面的查询
SELECT "options"."id", "options"."foo_option_id",
"options"."description"
FROM "vehicles"
INNER JOIN "vehicle_options"
ON "vehicle_options"."vehicle_id" = "vehicles"."id"
INNER JOIN "options"
ON "options"."id" = "vehicle_options"."option_id"
INNER JOIN "discounted_vehicles"
ON "vehicles"."id" = "discounted_vehicles"."vehicle_id"
WHERE "discounted_vehicles"."discount_id" = 4
GROUP BY options"."id", "options"."foo_option_id",
"options"."description"
最佳查询取决于缺失信息。
这在典型设置中应该快得多:
SELECT id, foo_option_id, description
FROM options o
WHERE EXISTS (
SELECT
FROM discounted_vehicles d
JOIN vehicle_options vo USING (vehicle_id)
WHERE d.discount_id = 4
AND vo.option_id = o.id
);
假设引用完整性,由 FK 约束强制执行,我们可以从查询中省略 table vehicle
并直接从 discounted_vehicles
连接到 vehicle_options
。
此外,如果每个不同选项有很多符合条件的行,EXISTS
通常会更快。
理想情况下,您应该在以下位置拥有多列索引:
discounted_vehicles(discount_id, vehicle_id)
vehicle_options(vehicle_id, option_id)
按此顺序索引列。您可能在提供第二个索引的 vehicle_options
上有 PK 约束,但列顺序应该匹配。相关:
- PostgreSQL composite primary key
- Is a composite index also good for queries on the first field?
根据实际数据分布情况,可能会有更快的查询方式。相关:
- Optimize GROUP BY query to retrieve latest record per user
- Select first row in each GROUP BY group?
更改 加入顺序 通常 无用。 Postgres 重新排序加入它期望最快的任何方式。 (例外情况适用。)相关:
- Sample Query to show Cardinality estimation error in PostgreSQL SQL INNER JOIN over multiple tables equal to WHERE syntax