在聚合查询中获取非聚合数据

Question

我不是 SQL 专业人士，但我相信我已经解决了我的问题，尽管是以一种相当低效的方式。我希望有人能指出比我想出的更好的方法。我试图在由 RelevanSSI（Wordpress 的全文搜索插件）创建的术语索引中找到重复或相似的内容 - 然而这是在 Wordpress 安装之外发生的，它是实际的数据库，所以 Wordpress，它是 API的和通常与之关联的任何其他 table 不在本文讨论范围之内。

相关 SSI 指数 table 看起来像这样：

CREATE TABLE `wp_relevanssi` (
 `doc` bigint(20) NOT NULL DEFAULT '0',
 `term` varchar(50) NOT NULL DEFAULT '0',
 `content` mediumint(9) NOT NULL DEFAULT '0',
 `title` mediumint(9) NOT NULL DEFAULT '0',
 `comment` mediumint(9) NOT NULL DEFAULT '0',
 `tag` mediumint(9) NOT NULL DEFAULT '0',
 `link` mediumint(9) NOT NULL DEFAULT '0',
 `author` mediumint(9) NOT NULL DEFAULT '0',
 `category` mediumint(9) NOT NULL DEFAULT '0',
 `excerpt` mediumint(9) NOT NULL DEFAULT '0',
 `taxonomy` mediumint(9) NOT NULL DEFAULT '0',
 `customfield` mediumint(9) NOT NULL DEFAULT '0',
 `mysqlcolumn` mediumint(9) NOT NULL DEFAULT '0',
 `taxonomy_detail` longtext NOT NULL,
 `customfield_detail` longtext NOT NULL,
 `mysqlcolumn_detail` longtext NOT NULL,
 `type` varchar(210) NOT NULL DEFAULT 'post',
 `item` bigint(20) NOT NULL DEFAULT '0',
 `term_reverse` varchar(50) NOT NULL DEFAULT '0',
 UNIQUE KEY `doctermitem` (`doc`,`term`,`item`),
 KEY `terms` (`term`(20)),
 KEY `docs` (`doc`),
 KEY `typeitem` (`type`,`item`),
 KEY `relevanssi_term_reverse_idx` (`term_reverse`(10))
) ENGINE=InnoDB DEFAULT CHARSET=utf8

我通过以下查询成功获得（我认为）我想要的信息：

SELECT r1.doc, r2.doc, 
    50 * COUNT( r1.term ) * (
        (c1.total + c2.total) / 
        ( c1.total * c2.total ) 
    ) AS ScorePct
FROM  `wp_relevanssi` r1
LEFT JOIN  `wp_relevanssi` r2 
ON r1.term = r2.term
AND r1.doc > r2.doc
AND r1.type = r2.type
AND (r1.content > 0 or r1.title > 0 or r1.taxonomy > 0 or r1.tag > 0)
AND (r2.content > 0 or r2.title > 0 or r2.taxonomy > 0 or r2.tag > 0)
LEFT JOIN (
    SELECT doc, COUNT( term ) AS total
    FROM  `wp_relevanssi` 
    GROUP BY doc
) c1 
ON r1.doc = c1.doc
LEFT JOIN (
    SELECT doc, COUNT( term ) AS total
    FROM  `wp_relevanssi` 
    GROUP BY doc
) c2 
ON r2.doc = c2.doc
GROUP BY r1.doc, r2.doc
HAVING ScorePct >50
ORDER BY ScorePct DESC

我的问题是那些大的 ol' 狡猾的子查询掉落到联接中。我认为我至少需要一个子查询来执行此操作（本质上，获取特定文档的术语总数），因为在第一个 LEFT JOIN 之后我们只有关于主查询中匹配术语的信息，丢弃了不匹配的。（请继续告诉我我错了，我很想找出不需要子查询）。

除此之外，我有没有办法用单个子查询来执行此操作，或者以其他方式提高此查询的性能？我完全希望它是一个非常繁重的查询，我对此没有任何疑虑，但我想尽可能地得到它运行。

编辑：所以我不得不用不同的方法解决这个问题 - 通过一次查看单个文档（随着该文档的更改）我可以将查询简化为：

SELECT r1.doc, r2.doc, count(*) AS matches
FROM  `wp_relevanssi` r1
INNER JOIN  `wp_relevanssi` r2 
ON r1.term = r2.term
AND r1.doc <> r2.doc
AND r1.type = r2.type
AND (r1.content > 0 or r1.title > 0 or r1.taxonomy > 0 or r1.tag > 0)
AND (r2.content > 0 or r2.title > 0 or r2.taxonomy > 0 or r2.tag > 0)
WHERE r1.doc = %d
GROUP BY r1.doc, r2.doc
ORDER BY matches DESC
LIMIT 0,10

即使有 650,000 行也能在合理的时间内运行，并跟进 :

SELECT doc, COUNT( term ) AS total
FROM  `wp_relevanssi` 
WHERE doc IN (%d,%d,%d...)
GROUP BY doc

然后在 DB 之外进行剩余的比分匹配。

Answer 1

COUNT(term) 意味着您需要测试 term 是否为 NOT NULL。如果不是，那么简单地说 COUNT(*).
你的LEFT JOINs好像是一样的；是什么赋予了？见下文。
JOIN ( SELECT ... ) 优化不佳当你有不止一个时。
LEFT 暗示 'right' 上的 'table' 可能缺少行，但在这种情况下您需要 NULLs。你需要那个吗？
"Prefix" 索引 (KEY terms (term(20))) 很少有益，而且常常会妨碍索引的使用。删除 (20).
InnoDB tables 应该有一个明确的 PRIMARY KEY。您拥有的 UNIQUE 密钥可以变成它。
这个查询似乎是 O(N*N)。也就是说，随着 wp_relevanssi.

对于dup子查询，考虑以下两处使用term_counts。

CREATE TABLE term_counts (
    PRIMARY KEY(doc)
)
    SELECT doc,
           COUNT( term ) AS total
        FROM  `wp_relevanssi` 
        GROUP BY doc;

因为这个

(r1.content > 0 or r1.title > 0 or r1.taxonomy > 0 or r1.tag > 0)

您应该考虑将过滤失败的所有行复制到另一个 table，然后使用那个 table。

因为

ON r1.term = r2.term
AND r1.doc > r2.doc
AND r1.type = r2.type

我同意

INDEX(term, type, doc)

（doc 必须在最后，term 和 type 可以任意顺序。）

在聚合查询中获取非聚合数据

Get Non-Aggregate data in Aggregate query

mysql

indexing

performance

subquery

aggregate-functions