MySQL GROUP BY 使查询速度减慢 x1000 倍

Question

我正在努力为使用 MySQL 数据库的 Django 应用程序设置适当、有效的索引。问题出在文章 table 上，它目前有超过 100 万行，查询速度没有我们想要的那么快。

文章 table 结构大致如下所示：

Field   Type
id  int
date_published  datetime(6) 
date_retrieved  datetime(6) 
title   varchar(500)    
author  varchar(200)    
content longtext    
source_id   int
online  tinyint(1)
main_article_of_duplicate_group tinyint(1)

经过多次尝试，我得出以下索引给出了最佳性能：

CREATE INDEX search_index ON newsarticle(date_published DESC, main_article_of_duplicate_group, source_id, online);

有问题的查询是：

SELECT 
    `newsarticle`.`id`,
    `newsarticle`.`url`,
    `newsarticle`.`date_published`,
    `newsarticle`.`date_retrieved`,
    `newsarticle`.`title`,
    `newsarticle`.`summary_provided`,
    `newsarticle`.`summary_generated`,
    `newsarticle`.`source_id`,
    COUNT(CASE WHEN `newsarticlefeedback`.`is_relevant` THEN `newsarticlefeedback`.`id` ELSE NULL END) AS `count_relevent`,
    COUNT(`newsarticlefeedback`.`id`) AS `count_nonrelevent`,
    (
      SELECT U0.`is_relevant`
      FROM `newsarticlefeedback` U0
      WHERE (U0.`news_id_id` = `newsarticle`.`id` AND U0.`user_id_id` = 27)
      ORDER BY U0.`created_date` DESC
      LIMIT 1
    ) AS `is_relevant`,
    CASE
        WHEN `newsarticle`.`content` = '' THEN 0
        ELSE 1
    END AS `is_content`,
    `newsproviders_newsprovider`.`id`,
    `newsproviders_newsprovider`.`name_long`
FROM
    `newsarticle` USE INDEX (SEARCH_INDEX)
        INNER JOIN
    `newsarticle_topics` ON (`newsarticle`.`id` = `newsarticle_topics`.`newsarticle_id`)
        LEFT OUTER JOIN
    `newsarticlefeedback` ON (`newsarticle`.`id` = `newsarticlefeedback`.`news_id_id`)
        LEFT OUTER JOIN
    `newsproviders_newsprovider` ON (`newsarticle`.`source_id` = `newsproviders_newsprovider`.`id`)
WHERE
    ((1)
        AND `newsarticle`.`main_article_of_duplicate_group`
        AND `newsarticle`.`online`
        AND `newsarticle_topics`.`newstopic_id` = 42
        AND `newsarticle`.`date_published` >= '2020-08-08 08:39:03.199488')
GROUP BY `newsarticle`.`id`
ORDER BY `newsarticle`.`date_published` DESC
LIMIT 30

注意：我必须显式使用索引，否则查询会慢得多。本次查询耗时约1.4s。

但是当我只删除 GROUP BY 语句时，查询需要 acceptable 1-10 毫秒。我试图将新闻文章 ID 添加到不同位置的索引，但没有成功。

这是 EXPLAIN 的输出（来自 Django）：

ID  SELECT_TYPE TABLE   PARTITIONS  TYPE    POSSIBLE_KEYS   KEY KEY_LEN REF ROWS    FILTERED    EXTRA
1   PRIMARY newsarticle_topics  None    ref newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,newsartic_newstopic_id_ddd996b6_fk_summarize   newsartic_newstopic_id_ddd996b6_fk_summarize    4   const   312628  100.0   Using temporary; Using filesort
1   PRIMARY newsarticle None    eq_ref  PRIMARY,newsartic_source_id_6ea2b978_fk_summarize,newsartic_topic_id_b67ae2c9_fk_summarize,kek,last_updated,last_update,search_index,fulltext_idx_content   PRIMARY 4   newstech.newsarticle_topics.newsarticle_id  1   22.69   Using where
1   PRIMARY newsarticlefeedback None    ref newsartic_news_id_id_5af7594b_fk_summarize  newsartic_news_id_id_5af7594b_fk_summarize  5   newstech.newsarticle_topics.newsarticle_id  1   100.0   None
1   PRIMARY newsproviders_newsprovider  None    eq_ref  PRIMARY,    PRIMARY 4   newstech.newsarticle.source_id  1   100.0   None
2   DEPENDENT SUBQUERY  U0  None    ref newsartic_news_id_id_5af7594b_fk_summarize,newsartic_user_id_id_fc217cfe_fk_auth_user   newsartic_user_id_id_fc217cfe_fk_auth_user  5   const   1   10.0    Using where; Using filesort

有趣的是，同一个查询在 MySQL Workbench 和 Django 调试工具栏中给出了不同的 EXPLAIN（如果你愿意，我也可以从 workbench 粘贴 EXPLAIN）。但性能或多或少是相同的。您是否知道如何增强索引以便快速搜索？

谢谢

编辑：我在此处粘贴 EXPLAIN from MySQL Workbench，它不同但似乎更真实（不确定为什么 Django 调试工具栏解释不同）

id  select_type table   partitions  type    possible_keys   key key_len ref rows    filtered    Extra
1   PRIMARY newsarticle NULL    range   PRIMARY,newsartic_source_id_6ea2b978_fk_,newsartic_topic_id_b67ae2c9_fk,kek,last_updated,last_update,search_index,fulltext_idx_content  search_index    8   NULL    227426  81.00   Using index condition; Using MRR; Using temporary; Using filesort
1   PRIMARY newsarticle_topics  NULL    eq_ref  newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,newsartic_newstopic_id_ddd996b6_fk newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq    8   newstech.newsarticle.id,const   1   100.00  Using index
1   PRIMARY newsarticlefeedback NULL    ref newsartic_news_id_id_5af7594b_fk    newsartic_news_id_id_5af7594b_fk    5   newstech.newsarticle.id 1   100.00  NULL
1   PRIMARY newsproviders_newsprovider  NULL    eq_ref  PRIMARY PRIMARY 4   newstech.newsarticle.source_id  1   100.00  NULL
2   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,newsartic_user_id_id_fc217cfe_fk_auth_user newsartic_user_id_id_fc217cfe_fk_auth_user  5   const   1   10.00   Using where; Using filesort

编辑2：下面是当我从查询中删除 GROUP BY 时的 EXPLAIN（使用 MySQL Workbench）：

id,select_type,table,partitions,type,possible_keys,key,key_len,ref,rows,filtered,Extra
1,SIMPLE,newsarticle,NULL,range,search_index,search_index,8,NULL,227426,81.00,"Using index condition"
1,SIMPLE,newsarticle_topics,NULL,eq_ref,"newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,newsartic_newstopic_id_ddd996b6_fk",newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,8,"newstech.newsarticle.id,const",1,100.00,"Using index"
1,SIMPLE,newsarticlefeedback,NULL,ref,newsartic_news_id_id_5af7594b_fk,newsartic_news_id_id_5af7594b_fk,5,newstech.newsarticle.id,1,100.00,"Using index"
1,SIMPLE,newsproviders_newsprovider,NULL,eq_ref,"PRIMARY,",PRIMARY,4,newstech.newsarticle.source_id,1,100.00,NULL

编辑 3：

应用 Rick 建议的更改后（谢谢！）：

newsarticle(id, 在线, main_article_of_duplicate_group, date_published) newsarticle_topics (newstopic_id, newsarticle_id) 和 (newsarticle_id, newstopic_id)

的两个索引

WITH USE_INDEX（耗时 1.2 秒）

解释：

id  select_type table   partitions  type    possible_keys   key key_len ref rows    filtered    Extra
1   PRIMARY newsarticle_topics  NULL    ref newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,opposite   opposite    4   const   346286  100.00  Using index; Using temporary; Using filesort
1   PRIMARY newsarticle NULL    ref search_index    search_index    4   newstech.newsarticle_topics.newsarticle_id  1   27.00   Using index condition
1   PRIMARY newsproviders_newsprovider  NULL    eq_ref  PRIMARY,filter_index    PRIMARY 4   newstech.newsarticle.source_id  1   100.00  NULL
4   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,feedback_index feedback_index  5   newstech.newsarticle.id 1   100.00  Using filesort
3   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,feedback_index newsartic_news_id_id_5af7594b_fk    5   newstech.newsarticle.id 1   10.00   Using where
2   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,feedback_index newsartic_news_id_id_5af7594b_fk    5   newstech.newsarticle.id 1   90.00   Using where

WITHOUT USE_INDEX 子句（耗时 2.6 秒）

id  select_type table   partitions  type    possible_keys   key key_len ref rows    filtered    Extra
1   PRIMARY newsarticle_topics  NULL    ref newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,opposite   opposite    4   const   346286  100.00  Using index; Using temporary; Using filesort
1   PRIMARY newsarticle NULL    eq_ref  PRIMARY,search_index    PRIMARY 4   newstech.newsarticle_topics.newsarticle_id  1   27.00   Using where
1   PRIMARY newsproviders_newsprovider  NULL    eq_ref  PRIMARY,filter_index    PRIMARY 4   newstech.newsarticle.source_id  1   100.00  NULL
4   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,feedback_index feedback_index  5   newstech.newsarticle.id 1   100.00  Using filesort
3   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,feedback_index newsartic_news_id_id_5af7594b_fk    5   newstech.newsarticle.id 1   10.00   Using where
2   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,feedback_index newsartic_news_id_id_5af7594b_fk    5   newstech.newsarticle.id 1   90.00   Using where

用于比较索引 - newsarticle(date_published DESC, main_article_of_duplicate_group, source_id, online) with USE INDEX（只需要 1-3ms！）

id  select_type table   partitions  type    possible_keys   key key_len ref rows    filtered    Extra
1   PRIMARY newsarticle NULL    range   search_index    search_index    8   NULL    238876  81.00   Using index condition
1   PRIMARY newsproviders_newsprovider  NULL    eq_ref  PRIMARY,filter_index    PRIMARY 4   newstech.newsarticle.source_id  1   100.00  NULL
1   PRIMARY newsarticle_topics  NULL    eq_ref  newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,opposite   newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq    8   newstech.newsarticle.id,const   1   100.00  Using index
4   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,feedback_index feedback_index  5   newstech.newsarticle.id 1   100.00  Using filesort
3   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,feedback_index feedback_index  6   newstech.newsarticle.id,const   1   100.00  Using index
2   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,feedback_index feedback_index  5   newstech.newsarticle.id 1   90.00   Using where; Using index

Answer 1

终于，我弄清楚了这个查询的问题所在。首先，在 Django 中，当在注释中使用 Count 时，会自动添加 GROUP BY 语句。所以最简单的解决方案是通过嵌套注释来避免它。

这在此处的答案中有很好的解释

感谢大家的时间和帮助:)

Answer 2

main_article_of_duplicate_group 是 true/false 标志吗？

如果优化器选择从newsarticle_topics开始：

 newsarticle_topics:  INDEX(newstopic_id, newsarticle_id)
 newsarticle:  INDEX(newsarticle_id, online,
                     main_article_of_duplicate_group, date_published)

如果 newsarticle_topics 是一个 many-to-many 映射 table，去掉 id 并使 PRIMARY KEY 成为那个对，加上一个二级索引相反的方向。更多讨论：http://mysql.rjweb.org/doc.php/index_cookbook_mysql#many_to_many_mapping_table

如果优化器选择从 newsarticle 开始（这似乎更有可能）：

 newsarticle_topics:  INDEX(newsarticle_id, newstopic_id)
 newsarticle:  INDEX(online, main_article_of_duplicate_group, date_published)

与此同时，newsarticlefeedback 需要这个，按照给定的顺序：

INDEX(news_id_id, user_id_id, created_date, isrelevant)

而不是

    COUNT(`newsarticlefeedback`.`id`) AS `count_nonrelevent`,
    LEFT OUTER JOIN  `newsarticlefeedback`
          ON (`newsarticle`.`id` = `newsarticlefeedback`.`news_id_id`)

有

    ( SELECT COUNT(*) FROM newsarticlefeedback
          WHERE `newsarticle`.`id` = `newsarticlefeedback`.`news_id_id`
    ) AS `count_nonrelevent`,

Answer 3

我碰巧有一种技术可以很好地处理按日期分类、过滤和排序的“新闻文章”。它甚至可以处理“禁运”、“过期”、软“删除”等

大目标是执行时只触及 30 行

ORDER BY `newsarticle`.`date_published` DESC
LIMIT 30

但目前 WHERE 子句必须查看两个 table 才能进行过滤。这导致触及 35K 或可能更多的行。

它需要在具有 3 列的一侧构建一个简单的 table：

主题（或其他过滤类别），
日期（仅获取最新的 30 个），
article_id（仅进行 30 次 JOIN 以获得文章信息的其余部分）

Suitable 索引table 使搜索非常高效。

With suitable DELETEs in this table，像online或main_article这样的简单标志可以得到有效处理。不要在这个额外的 table 中包含标志；而是不包括不应显示的任何行。

更多详情：http://mysql.rjweb.org/doc.php/lists

（我看到其他“新闻”网站因为没有使用这种技术而崩溃。）

请注意，30 和 35K 之间的差异约为 1000 倍。

MySQL GROUP BY 使查询速度减慢 x1000 倍

MySQL GROUP BY slows down query x1000 times

mysql

sql

django

indexing

query-performance