MYSQL 缓慢的持续时间或获取时间取决于 "distinct" 命令

Question

3 个索引定义为：

唯一unique_ID(refDate,instrument)

参考日期（参考日期）

仪器（乐器）

现在的行数约为 1000 万，但对于每个 refDate，目前只有大约 5000 种不同的工具

我有一个查询，该查询在 table 上自行加入以生成如下输出： refDate|利率工具=X |利率工具 = Y|汇率工具=Z|....

基本上返回时间序列数据，然后我可以在其中进行自己的分析。

这是问题所在：我的原始查询如下所示：

Select distinct AUDSpot1yFq.refDate,AUDSpot1yFq.rate as 'AUDSpot1yFq',
AUD1y1yFq.rate as AUD1y1yFq
from audratedb AUDSpot1yFq inner join audratedb AUD1y1yFq on
AUDSpot1yFq.refDate=AUD1y1yFq.refDate 
where AUDSpot1yFq.instrument = 'AUDSpot1yFq' and 
AUD1y1yFq.instrument = 'AUD1y1yFq' 
order by AUDSpot1yFq.refDate

请注意，在下面这个特定的计时查询中，我实际上得到了 10 种不同的工具，这意味着查询要长得多，但遵循相同的命名模式、内部连接和 where 语句。

这很慢，在 workbench 我将其计时为 7-8 秒的持续时间（但接近 0 的获取时间，因为我 workbench 在运行服务器的机器上).当我剥离 distinct 时，持续时间下降到 0.25-0.5 秒（更易于管理），当我剥离 "order by" 时，它变得更快（<0.1 秒，此时我不在乎）。但我的 Fetchtime 激增至 7 秒。所以总的来说，我一无所获，但这一切都变成了一个获取时间的问题。当我运行来自将执行提升和工作的 python 脚本的查询时，无论是否包含 distinct，我得到的时间大致相同。

当我运行解释我的缩减查询时（它的获取时间很糟糕）我得到：

1   SIMPLE  AUDSpot1yFq     ref unique_ID,refDate,instrument    instrument  39  const   1432    100.00  Using where
1   SIMPLE  AUD1y1yFq       ref unique_ID,refDate,instrument    unique_ID   42  historicalratesdb.AUDSpot1yFq.refDate,const 1   100.00  Using where
1   SIMPLE  AUD2y1yFq       ref unique_ID,refDate,instrument    unique_ID   42  historicalratesdb.AUDSpot1yFq.refDate,const 1   100.00  Using where
1   SIMPLE  AUD3y1yFq       ref unique_ID,refDate,instrument    unique_ID   42  historicalratesdb.AUDSpot1yFq.refDate,const 1   100.00  Using where
1   SIMPLE  AUD4y1yFq       ref unique_ID,refDate,instrument    unique_ID   42  historicalratesdb.AUDSpot1yFq.refDate,const 1   100.00  Using where
1   SIMPLE  AUD5y1yFq       ref unique_ID,refDate,instrument    unique_ID   42  historicalratesdb.AUDSpot1yFq.refDate,const 1   100.00  Using where
1   SIMPLE  AUD6y1yFq       ref unique_ID,refDate,instrument    unique_ID   42  historicalratesdb.AUDSpot1yFq.refDate,const 1   100.00  Using where
1   SIMPLE  AUD7y1yFq       ref unique_ID,refDate,instrument    unique_ID   42  historicalratesdb.AUDSpot1yFq.refDate,const 1   100.00  Using where
1   SIMPLE  AUD8y1yFq       ref unique_ID,refDate,instrument    unique_ID   42  historicalratesdb.AUDSpot1yFq.refDate,const 1   100.00  Using where
1   SIMPLE  AUD9y1yFq       ref unique_ID,refDate,instrument    unique_ID   42  historicalratesdb.AUDSpot1yFq.refDate,const 1   100.00  Using where

我现在意识到 distinct 不是必需的，当我将输出输出到数据帧时，我可以将 order by 扔掉并在 pandas 中排序。太棒了。但我不知道如何减少获取时间。我不会在该网站上赢得任何能力竞赛，但我已尽我所能进行搜索，但找不到解决此问题的方法。非常感谢任何帮助。

~cocoa

Answer 1

该问题没有提及现有索引，也没有显示任何查询的 EXPLAIN 的输出。

提高性能的快速答案是添加索引：

   ... ON audratedb (instrument,refdate,rate)

要回答我们为什么要添加该索引，我们需要了解 MySQL 如何处理 SQL 语句，哪些操作是可能的，哪些是必需的。要查看 MySQL 如何实际处理您的语句，您需要使用 EXPLAIN 查看查询计划。

Answer 2

（我必须简化 table 别名才能阅读它:)

Select  distinct
           s.refDate,
           s.rate as AUDSpot1yFq,
           y.rate as AUD1y1yFq
    from  audratedb AS s
    join  audratedb AS y  on s.refDate = y.refDate
    where  s.instrument = 'AUDSpot1yFq'
      and  y.instrument = 'AUD1y1yFq'
    order by  s.refDate

需要索引：

INDEX(instrument, refDate)  -- To filter and sort, or
INDEX(instrument, refDate, rate)  -- to also "cover" the query.

假定查询并不比您说的复杂。我看到 EXPLAIN 已经有更多的 table。请提供 SHOW CREATE TABLE audratedb 和整个 SELECT.

回到你的问题...

DISTINCT 通过以下两种方式之一完成：(1) 对 table 进行排序，然后进行去重，或 (2) 在内存中的散列中进行去重。请记住，您正在删除所有 3 列的重复数据（refDate、s.rate、y.rate）。

ORDER BY是收集所有数据后的排序。但是，使用建议的索引（ 而不是 您拥有的索引），不需要排序，因为索引将按所需顺序获取行。

但是... DISTINCT 和 ORDER BY 可能会使优化器感到困惑，以至于它会做一些事情 'dumb'。

你说(refDate,instrument)是UNIQUE，但是你没有提到PRIMARY KEY，也没有提到你使用的是哪个引擎。如果您使用的是 InnoDB，那么按 PRIMARY KEY(instrument, refDate)、的顺序会进一步加快速度，并避免需要任何新索引。

此外，有(a,b)还有(a)是多余的。也就是说，您当前的架构不需要 INDEX(refDate)，但是通过更改 PK，您将不需要 INDEX(instrument)，而是

底线：仅

PRIMARY KEY(instrument, refDate),
INDEX(refDate)

并且没有其他索引（除非您可以显示一些需要它的查询）。

有关 EXPLAIN 的更多信息。请注意 Rows 列是如何表示 1432, 1, 1, ... 这意味着它扫描了第一个 table 的估计 1432 行。由于缺少适当的索引，这可能远远超出必要。然后它只需要查看其他每个 table 中的 1 行。（没有比这更好的了。）

SELECT没有DISTINCT或ORDER BY有多少行？这告诉您在完成抓取和 JOINing 之后需要多少工作。我怀疑这只是少数。 "few" 对 DISTINCT 和 ORDER BY 来说真的很便宜；因此我认为你找错人了。即使是 1432 行处理起来也非常快。

至于buffer_pool……table有多大？做 SHOW TABLE STATUS。我怀疑 table 超过 1GB，因此 buffer_pool 放不下。因此，提高缓存大小将使查询运行在 RAM 中，而不是命中磁盘（至少在它被缓存之后）。请记住，运行对冷缓存的查询将有很多 I/O。随着缓存预热，查询将运行更快。但是如果缓存太小，你会继续需要I/O。 I/O 是处理过程中最慢的部分。

我希望你至少有 6GB 的内存；否则，2G 可能会非常大。交换对性能来说真的很糟糕。

MYSQL 缓慢的持续时间或获取时间取决于 "distinct" 命令

MYSQL slow duration or fetch time depending on "distinct" command

mysql

optimization

duration

distinct

fetch