`MySQL GROUP BY 使用索引时速度较慢

`MySQL GROUP BY is slower when using index

我 运行 在 AWS m4.large(2 个 vCPU,8 GB 内存)上,我看到关于 MySQL 和 GROUPBY 的一些令人惊讶的行为。我有这个测试数据库:

CREATE TABLE demo (
  time INT,
  word VARCHAR(30),
  count INT
);
CREATE INDEX timeword_idx ON demo(time, word);

我插入 4,000,000 条记录(均匀地)随机词 "t%s" % random.randint(0, 30000) 和时间 random.randint(0, 86400)

SELECT word, time, sum(count) FROM demo GROUP BY time, word;
3996922 rows in set (1 min 28.29 sec)

EXPLAIN SELECT word, time, sum(count) FROM demo GROUP BY time, word;
+----+-------------+-------+-------+---------------+--------------+---------+------+---------+-------+
| id | select_type | table | type  | possible_keys | key          | key_len | ref  | rows    | Extra |
+----+-------------+-------+-------+---------------+--------------+---------+------+---------+-------+
|  1 | SIMPLE      | demo  | index | NULL          | timeword_idx | 38      | NULL | 4002267 |       |
+----+-------------+-------+-------+---------------+--------------+---------+------+---------+-------+

然后我不使用索引:

SELECT word, time, sum(count) FROM demo IGNORE INDEX (timeword_idx) GROUP BY time, word;
3996922 rows in set (34.75 sec)

EXPLAIN SELECT word, time, sum(count) FROM demo IGNORE INDEX (timeword_idx) GROUP BY time, word;
+----+-------------+-------+------+---------------+------+---------+------+---------+---------------------------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows    | Extra                           |
+----+-------------+-------+------+---------------+------+---------+------+---------+---------------------------------+
|  1 | SIMPLE      | demo  | ALL  | NULL          | NULL | NULL    | NULL | 4002267 | Using temporary; Using filesort |
+----+-------------+-------+------+---------------+------+---------+------+---------+---------------------------------+

正如您所见,通过使用索引,查询花费的时间增加了 3 倍。我并不感到惊讶,因为通过使用索引,查询可能必须避免读取 timeword 列,但不幸的是,索引是如此稀疏,它应该不会获得太多。相反,在检索 count.

时,它将直接扫描转换为随机访问模式

我只是想确认这就是原因,并且想知道是否有关于何时和索引在用于 GROUP BY 时最终会带来更差性能的 "compact rule"。

编辑:

我遵循了 Gordon Linoff 的回答并使用了:

CREATE INDEX timeword_idx ON demo(time, word, count);

与全扫描相比,"covering index" 计算结果快 10 倍:

SELECT word, time, sum(count) FROM demo GROUP BY time, word;
3996922 rows in set (3.36 sec)

EXPLAIN SELECT word, time, sum(count) FROM demo GROUP BY time, word;
+----+-------------+-------+-------+---------------+--------------+---------+------+---------+-------------+
| id | select_type | table | type  | possible_keys | key          | key_len | ref  | rows    | Extra       |
+----+-------------+-------+-------+---------------+--------------+---------+------+---------+-------------+
|  1 | SIMPLE      | demo  | index | NULL          | timeword_idx | 43      | NULL | 4002267 | Using index |
+----+-------------+-------+-------+---------------+--------------+---------+------+---------+-------------+

印象深刻!

您的 table 大小合理,因此问题可能是数据的顺序访问或抖动。使用索引需要通过索引,然后在数据页中查找数据,得到count.

这实际上可能比只阅读页面并进行排序更糟糕,因为页面没有按顺序阅读。顺序读取比随机读取优化得多。在最坏的情况下,页面缓存已满,随机读取需要刷新页面。如果发生这种情况,可能需要多次读取单个页面。只有 400 万个相对较小的行,除非内存严重受限,否则不太可能出现抖动。

如果这个解释是正确的,那么在索引中包含 count 应该会加快查询速度:

CREATE INDEX timeword_idx ON demo(time, word, count);

来自手册页How MySQL Uses Indexes

Indexes are less important for queries on small tables, or big tables where report queries process most or all of the rows. When a query needs to access most of the rows, reading sequentially is faster than working through an index. Sequential reads minimize disk seeks, even if not all the rows are needed for the query.

至于在更多的列上添加覆盖索引(不访问数据页但索引中所有数据可用的索引),请小心。他们是有代价的。就您而言,您的索引无论如何都会变宽。但始终需要谨慎的平衡。

正如 spencer 所暗示的那样,基数总是对范围起作用。对于基数信息,请使用 show index from tblName 命令。这不是您查询的驱动问题,但在其他设置中很有用。我应该换一种说法:你的 table 的基数非常高。所以你的索引在那个查询中被认为是它的障碍。