应该使用哪个查询？从 MySQL 推导解释

Question

Explaining MySQL O'reilly Optimizing SQL Statments Book 的Explain chapter，最后有这个问题。

The following is an example of a business need that retrieves orphaned parent records in a parent/child relationship. This SQL query can be written in three different ways. While the output produces the same results, the QEP shows three different paths.

mysql> EXPLAIN SELECT p.*
    -> FROM parent p
    -> WHERE p.id NOT IN (SELECT c.parent_id FROM child c)\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: p
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 160
        Extra: Using where
*************************** 2. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: c
         type: index_subquery
possible_keys: parent_id
          key: parent_id
      key_len: 4
          ref: func
         rows: 1
        Extra: Using index
2 rows in set (0.00 sec)



mysql> EXPLAIN SELECT p.*
    -> FROM parent p
    -> LEFT JOIN child c ON p.id = c.parent_id
    -> WHERE c.child_id IS NULL\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: p
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 160
        Extra:
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: c
         type: ref
possible_keys: parent_id
          key: parent_id
      key_len: 4
          ref: test.p.id
         rows: 1
        Extra: Using where; Using index; Not exists
2 rows in set (0.00 sec)



mysql> EXPLAIN SELECT p.*
    -> FROM parent p
    -> WHERE NOT EXISTS
    -> SELECT parent_id FROM child c WHERE c.parent_id = p.id)\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: p
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 160
        Extra: Using where
*************************** 2. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: c
         type: ref
possible_keys: parent_id
          key: parent_id
      key_len: 4
          ref: test.p.id
         rows: 1
        Extra: Using index
2 rows in set (0.00 sec)

Which is best? Will data growth over time cause a different QEP to perform better?

据我所知，书上或网上都没有答案。

Answer 1

有一个 old article from 2009 我在 Whosebug 上看到过多次链接。那里的测试表明，NOT EXISTS 查询比其他两个查询（LEFT JOIN 和 NOT IN）慢 27%（实际上是 26%）。

但是，优化器已经从一个版本到另一个版本进行了改进。完美的优化器会为所有三个查询创建相同的执行计划。但只要优化器不完美，"Which query is faster?" 上的答案可能取决于实际设置（包括版本、设置和数据）。

我过去曾运行进行过类似的测试，我只记得 LEFT JOIN 从未比任何其他方法慢得多。但出于好奇，我刚刚使用默认设置在 MariaDB 10.3.13 可移植 Windows 版本上创建了一个新测试。

虚拟数据：

set @parents = 1000;

drop table if exists parent;
create table parent(
    parent_id mediumint unsigned primary key
);
insert into parent(parent_id)
    select seq
    from seq_1_to_1000000
    where seq <= @parents
;

drop table if exists child;
create table child(
    child_id mediumint unsigned primary key,
    parent_id mediumint unsigned not null,
    index (parent_id)
);
insert into child(child_id, parent_id)
    select seq as child_id
    , floor(rand(1)*@parents)+1 as parent_id
    from seq_1_to_1000000
;

不在：

set @start = TIME(SYSDATE(6));

select count(*) into @cnt
from parent p
where p.parent_id not in (select parent_id from child c);

select @cnt, TIMEDIFF(TIME(SYSDATE(6)), @start);

左连接：

set @start = TIME(SYSDATE(6));

select count(*) into @cnt
from parent p
left join child c on c.parent_id = p.parent_id
where c.parent_id is null;

select @cnt, TIMEDIFF(TIME(SYSDATE(6)), @start);

不存在：

set @start = TIME(SYSDATE(6));

select count(*) into @cnt
from parent p
where not exists (
    select *
    from child c
    where c.parent_id = p.parent_id
);

select @cnt, TIMEDIFF(TIME(SYSDATE(6)), @start);

以毫秒为单位的执行时间：

@parents   | 1000 | 10000 | 100000 | 1000000
-----------|------|-------|--------|--------
NOT IN     |   21 |    38 |    175 |    4459
LEFT JOIN  |   24 |    40 |    183 |    1508
NOT EXISTS |   26 |    44 |    180 |    4463

我已经执行了多次查询并且使用了最少的时间值。 SYSDATE 可能不是衡量执行时间的最佳方法 - 所以不要认为这些数字是准确的。不过我们可以看到，最多100K的父行，差别不大，NOT IN的方法要快一点。但是对于 1M 父行，LEFT JOIN 快三倍。

结论

那么答案是什么？我只能说："LEFT JOIN" 赢了。但事实是——这个测试什么也证明不了。答案是（多次）："It depends"。当性能很重要时，您能做的最好的事情就是运行您自己的测试，并针对真实数据进行真实查询。如果您（还）没有真实数据，您应该创建具有您期望将来拥有的数量和分布的虚拟数据。

Answer 2

这取决于您使用的 MySQL 版本。在旧版本中，IN ( SELECT ...) 表现得非常糟糕。在最新版本中，它通常与其他变体一样好。另外，MariaDB 有一些优化差异，可能在这方面。

EXISTS( SELECT 1 ... ) 可能最清楚地说明了意图。而且它可能一直（一旦它出现）就很快。

NOT IN 和 NOT EXISTS 是不同的动物。

您的问题中的某些内容可能会产生影响：func 和 index_subquery。在类似的查询中，您可能看不到这些，这种差异可能会导致性能差异。

或者，重复一遍：

“自 2009 年以来，优化器有了很多改进。

"致作者 (Quassnoi)：请重新运行你的测试，并指定它们运行针对的是哪个版本。另请注意 MySQL 和 MariaDB 可能会产生不同的结果结果。

"To the Reader: Test the variants yourself, do not blindly trust the conclusions in this blog."

应该使用哪个查询？从 MySQL 推导解释

Which query should be used? Deducing from MySQL Explain

mysql

innodb

explain

虚拟数据：

不在：

左连接：

不存在：

以毫秒为单位的执行时间：

结论