比较查询性能：Join 与 Select Distinct From Table

Question

我有两个 tables person 和 city。 person table 和 city table 使用 city_id 亲自连接。 person table 包含大约 million 行和 city table 大约10000 行。

indexes on person: index1: id, index2: city_id
indexes on city:   index1: id

我需要 select 所有那些没有与之关联的人行的城市。 city 和 person table 如下（演示数据）.

CITY                PERSON

id  city            id  name   city_id
-------------       ------------------
1    city-1         1   name-1   1
2    city-2         2   name-2   2
3    city-3         3   name-3   2
4    city-4         4   name-4   3
5    city-5         5   name-5   1
6    city-6         6   name-6   3
7    city-7         7   name-7   4
8    city-8         8   name-8   8

我写了两个查询来得到结果：

查询 1:

     select c.id, c.city 
     from city c 
     left join person p on c.id = p.city_id  
     where p.id is null

查询 2:

     select * 
     from city 
     where id not in ( select distinct city_id from person)

两个查询的执行计划看起来很相似：

对于查询 1：对于查询 2：

然后我使用分析和运行两次查询，看看他们花了多少时间：

query1: 0.000729 0.000737 0.000763
query2: 0.000857 0.000840 0.000852

从上面的数据中可以明显看出，query1 优于 query2。

我很困惑，因为我理解 query2 应该优于 query1。因为 query2 的嵌套查询正在使用索引的 city_id 并且 mysql 可以利用 city_id index 来获取所有 id's 但 query1 正在使用 join，它将采用两个 table 的笛卡尔积。是不是因为我用的数据少 f. 个人(1000)条城市(200)条记录。

由于 query1 的性能优于 query2，我错过了什么。

编辑

来自 mysql 文档：

covering index: An index that includes all the columns retrieved by a query. Instead of using 
the index values as pointers to find the full table rows, the query returns values 
from the index structure, saving disk I/O

这是我提出query2时的假设。

Answer 1

你们的表现差异很小。您确实需要运行多次查询以查看差异是否相关。行数也很小。十有八九，所有数据都在一个或两个数据页上。所以，你不能从你的例子中概括（即使结果是正确的）。

我建议这样写：

select c.* 
from city c
where not exists (select 1 from person p where p.city_id = c.id);

为了性能，您需要在 person(city_id) 上建立索引。

这可能与 left join 具有相同的执行计划。我只是觉得它是一个更清晰的意图陈述——而且它通常在任何数据库上都有很好的性能。

not in 并不完全等价。以下是一些原因：

select distinct 可能会导致优化器失效。它不是必需的，但某些数据库实际上可能运行不同。
NULLs 的处理方式不同。如果子查询中的任何行 returns 一个 NULL 值，那么 根本没有行 将从外部返回查询。

Answer 2

您可以删除 NOT IN 中的不同记录，因为 IN() 本身会考虑不同的记录。在您的上述查询中，这里的连接在某种程度上得到了优化，因为没有额外的 select 来检索连接中的数据。但这仍然取决于。

我会说加入通常很昂贵。

比较查询性能：Join 与 Select Distinct From Table

Comparing Query performance: Join Vs Select Distinct From Table

mysql

sql

query-optimization

relational-database

query-performance