MySQL: 强制查询在 WHERE 子句中使用带有局部变量的索引
MySQL: Forcing query to use indices with local variable in WHERE clause
上下文
我有一个从 table 中选择加权随机条目的应用程序,(权重的)前缀求和是一个关键部分。简化的 table 定义如下所示:
CREATE TABLE entries (
id INT NOT NULL PRIMARY KEY AUTO_INCREMENT,
weight DECIMAL(9, 3),
fenwick DECIMAL(9, 3)
) ENGINE=MEMORY;
其中 `fenwick`
将值存储在 `weights`
的 Fenwick 树表示中。
让每个条目的 "range" 跨越其前缀和及其前缀和 + 其权重。应用程序必须在 0
和 SUM(weight)
之间生成一个随机数 @r
并找到范围包含 @r
的条目,如下所示:
Fenwick 树与 MEMORY
引擎和二进制搜索相结合,应该允许我在 O(lg^2(n))
时间内找到合适的条目,而不是 O(n)
时间天真的查询:
SELECT a.id-1 FROM (SELECT *, (@x:=@x+weight) AS counter FROM entries
CROSS JOIN (SELECT @x:=0) a
HAVING counter>@r LIMIT 1) a;
研究
由于多个查询的开销,我一直在尝试将前缀和操作压缩到一个查询中(而不是在脚本语言中看到的多个数组访问)。在此过程中,我意识到传统的求和方法(涉及按降序键顺序访问元素)只会对第一个元素求和。当变量出现在 WHERE
子句中时,我怀疑 MySQL 线性运行通过 tables。这是查询:
SELECT
SUM(1) INTO @garbage
FROM entries
CROSS JOIN (
SELECT @sum:=0,
@n:=@entryid
) a
WHERE id=@n AND @n>0 AND (@n:=@n-(@n&(-@n))) AND (@sum:=@sum+entries.fenwick);
/*SELECT @sum*/
其中 @entryid
是我们正在计算其前缀和的条目的 ID。我确实创建了一个确实有效的查询(以及 returns 整数最左边的函数 lft
):
SET @n:=lft(@entryid);
SET @sum:=0;
SELECT
SUM(1) INTO @garbage
FROM entries
WHERE id=@n
AND @n<=@entryid
AND (@n:=@n+lft(@entryid^@n))
AND (@sum:=@sum+entries.fenwick);
/*SELECT @sum*/
但这只是证实了我对线性搜索的怀疑。 EXPLAIN
查询也是如此:
+------+-------------+---------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+---------+------+---------------+------+---------+------+--------+-------------+
| 1 | SIMPLE | entries | ALL | NULL | NULL | NULL | NULL | 752544 | Using where |
+------+-------------+---------+------+---------------+------+---------+------+--------+-------------+
1 row in set (0.00 sec)
索引:
SHOW INDEXES FROM entries;
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| entries | 0 | PRIMARY | 1 | id | NULL | 752544 | NULL | NULL | | HASH | | |
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
1 row in set (0.00 sec)
现在,我看到很多问题询问如何在 WHERE
子句中 消除 变量,以便优化器可以处理查询。但是,我想不出没有 id=@n
这个查询的方法。我考虑过将要求和的条目的键值放入 table 并使用连接,但我相信我会产生不良影响:过多的 table 或线性无论如何,通过对 @entryid
进行评估进行搜索。
问题
有没有任何方法强制MySQL使用这个查询的索引?如果他们提供此功能,我什至会尝试不同的 DBMS。
(因为有格式化选项,所以使用了答案框)。
正如 Rick 所说,引擎是这种情况下的主要问题。您可以通过在索引创建中使用 "USING BTREE" 来影响创建的索引类型(在这种情况下,BTREE 或 HASH 似乎并不重要:您遍历一个范围:然后 BTREE 是最佳的。无论您如何检索它每个值,然后 HASH 是最优的:您的查询具有两种行为)。
当您切换到 INNODB 时,缓存将使查询可能与内存中的查询一样快 table。然后您就可以享受索引的好处。为了保证 BTREE 索引,我将创建如下模式:
CREATE TABLE `entries` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`weight` decimal(9,3) DEFAULT NULL,
`fenwick` decimal(9,3) NOT NULL DEFAULT '0.000',
PRIMARY KEY (`id`)
) ENGINE=INNODB DEFAULT CHARSET=latin1;
CREATE UNIQUE INDEX idx_nn_1 ON entries (id,fenwick) USING BTREE;
这在主要计算中使用了 idx_nn_1 索引(而且只有索引:整个 table 根本没有被使用,因为所有数据都在索引中)。然而,100 条记录的样本量太小,无法给出有关性能的任何明确答案。与仅使用 table 访问的数据相比,构建索引所花费的时间可能会让您根本没有任何性能提升。所以最终的答案将在你的测试中。
其他数据库引擎(SQL Server、Oracle、Postgres):它们会表现出类似的行为。因此,切换到这些引擎中的任何一个都不会产生巨大的差异,除非通常是为了更好的处理(无法预测这一点)。
SQL 服务器在构建索引时可能会更好(=更快),因为那样只会在 id 上使用唯一索引并包含 fenwick 值,因此不必真正索引该值。
Oracle 确实可以强制索引,但是不建议这样做:在 Oracle 中,假设 table 中的有序数据,读取 table 比读取索引然后 table 进行查找。同样在这种情况下,您可以只添加 id、fenwick 索引而永远不会访问 table。考虑到索引创建时间,Oracle 无论如何都必须读取完整的 table 一次,并且在那个时候(或更少取决于达到退出条件需要多少记录)它也将执行您的计算.
Fenwick 树是否足够静态以预先计算一些东西?如果是这样,我可以给你一个几乎 O(1) 的解决方案:
- 构建 2 列 table (n, cumulative_sum)
- 预填充 table: (1, 0.851), (2, 2.574), (3, 2.916), (4, 3.817), ...
- 在 FLOAT 上创建索引
然后查找:
SELECT n FROM tbl WHERE cumulative_sum > 3.325
ORDER BY cumulative_sum LIMIT 1;
如果@variables 有问题,则让存储过程通过CONCAT
、PREPARE
和EXECUTE
.
构造SQL
附录
鉴于它是周期性的总替换,请在重建 table 时计算累计和。我的 SELECT
只查看一行,所以它是 O(1)(忽略 BTree 查找)。
对于"total replacement",我推荐:
CREATE TABLE new LIKE real;
load the data into `new` -- this is the slowest step
RENAME TABLE real TO old, new TO real; -- atomic and fast, so no "downtime"
DROP TABLE old;
免责声明
Fenwick 树对我来说是新的,我是在找到这个 post 时才发现它们的。
这里给出的结果是基于我的理解和一些研究,
但我绝不是芬威克树专家,我可能漏掉了一些东西。
参考资料
芬威克树工作原理说明
转载自
https://cs.stackexchange.com/a/10541/38148
https://cs.stackexchange.com/a/42816/38148
芬威克树的使用
https://en.wikipedia.org/wiki/Fenwick_tree
https://en.wikipedia.org/wiki/Prefix_sum
问题 1,找到给定条目的权重
鉴于以下 table
CREATE TABLE `entries` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`weight` decimal(9,3) DEFAULT NULL,
`fenwick` decimal(9,3) NOT NULL DEFAULT '0.000',
PRIMARY KEY (`id`)
) ENGINE=INNODB;
并且给定数据已经填充(参见 concat 提供的 http://sqlfiddle.com/#!9/be1f2/1),
如何计算给定条目的重量 @entryid
?
这里要理解的关键概念是,fenwick 指数的结构是基于 math 和 bitwise operations 对 id重视自己。
查询通常应仅使用主键查找 (WHERE ID = value
)。
任何使用排序 (ORDER BY
) 或范围 (WHERE (value1 < ID) AND (ID < value2))
的查询都没有抓住要点,并且没有按预期顺序遍历树。
例如,使用键 60:
SET @entryid := 60;
让我们把值60分解成二进制
mysql> SELECT (@entryid & 0x0080) as b8,
-> (@entryid & 0x0040) as b7,
-> (@entryid & 0x0020) as b6,
-> (@entryid & 0x0010) as b5,
-> (@entryid & 0x0008) as b4,
-> (@entryid & 0x0004) as b3,
-> (@entryid & 0x0002) as b2,
-> (@entryid & 0x0001) as b1;
+------+------+------+------+------+------+------+------+
| b8 | b7 | b6 | b5 | b4 | b3 | b2 | b1 |
+------+------+------+------+------+------+------+------+
| 0 | 0 | 32 | 16 | 8 | 4 | 0 | 0 |
+------+------+------+------+------+------+------+------+
1 row in set (0.00 sec)
换句话说,只保留设置的位,我们有
32 + 16 + 8 + 4 = 60
现在,逐一移除最低位以导航树:
32 + 16 + 8 + 4 = 60
32 + 16 + 8 = 56
32 + 16 = 48
32
这给出了访问元素 60 的路径 (32, 48, 56, 60)。
请注意,将 60
转换为 (32, 48, 56, 60)
只需要对 ID 值本身进行位运算:不需要访问 table 或数据库,并且可以完成此计算在发出查询的客户端中。
则60号元素的芬维克权重为
mysql> select sum(fenwick) from entries where id in (32, 48, 56, 60);
+--------------+
| sum(fenwick) |
+--------------+
| 32.434 |
+--------------+
1 row in set (0.00 sec)
验证
mysql> select sum(weight) from entries where id <= @entryid;
+-------------+
| sum(weight) |
+-------------+
| 32.434 |
+-------------+
1 row in set (0.00 sec)
现在,让我们比较一下这些查询的效率。
mysql> explain select sum(fenwick) from entries where id in (32, 48, 56, 60);
+----+-------------+---------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
| 1 | SIMPLE | entries | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 4 | 100.00 | Using where |
+----+-------------+---------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
或者,以不同的方式呈现
explain format=json select sum(fenwick) from entries where id in (32, 48, 56, 60);
{
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "5.61"
},
"table": {
"table_name": "entries",
"access_type": "range",
"possible_keys": [
"PRIMARY"
],
"key": "PRIMARY",
"used_key_parts": [
"id"
],
"key_length": "4",
"rows_examined_per_scan": 4,
"rows_produced_per_join": 4,
"filtered": "100.00",
"cost_info": {
"read_cost": "4.81",
"eval_cost": "0.80",
"prefix_cost": "5.61",
"data_read_per_join": "64"
},
"used_columns": [
"id",
"fenwick"
],
"attached_condition": "(`test`.`entries`.`id` in (32,48,56,60))"
}
}
因此,优化器通过主键获取了 4 行(IN 子句中有 4 个值)。
不使用芬威克指数时,我们有
mysql> explain select sum(weight) from entries where id <= @entryid;
+----+-------------+---------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
| 1 | SIMPLE | entries | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 60 | 100.00 | Using where |
+----+-------------+---------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
或者,以不同的方式呈现
explain format=json select sum(weight) from entries where id <= @entryid;
{
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "25.07"
},
"table": {
"table_name": "entries",
"access_type": "range",
"possible_keys": [
"PRIMARY"
],
"key": "PRIMARY",
"used_key_parts": [
"id"
],
"key_length": "4",
"rows_examined_per_scan": 60,
"rows_produced_per_join": 60,
"filtered": "100.00",
"cost_info": {
"read_cost": "13.07",
"eval_cost": "12.00",
"prefix_cost": "25.07",
"data_read_per_join": "960"
},
"used_columns": [
"id",
"weight"
],
"attached_condition": "(`test`.`entries`.`id` <= (@`entryid`))"
}
}
此处优化器执行索引扫描,读取 60 行。
在 ID=60 的情况下,fenwick 的好处是 4 次获取与 60 次相比。
现在,考虑一下它是如何缩放的,例如值高达 64K。
使用 fenwick,一个 16 位的值将最多设置 16 位,因此要查找的元素数量最多为 16 个。
如果没有 fenwick,一次扫描最多可以读取 64K 个条目(平均将读取 32K 个)。
问题 2,找到给定权重的条目
OP 问题是找到给定权重的条目。
例如
SET @search_weight := 35.123;
为了说明该算法,post 详细说明了查找是如何完成的(抱歉,如果这太冗长)
SET @found_id := 0;
首先,找出有多少条目。
SET @max_id := (select id from entries order by id desc limit 1);
在测试数据中,max_id为156。
因为128 <= max_id < 256,开始查找的最高位是128。
mysql> set @search_id := @found_id + 128;
mysql> select id, fenwick, @search_weight,
-> if (fenwick <= @search_weight, "keep", "discard") as action
-> from entries where id = @search_id;
+-----+---------+----------------+---------+
| id | fenwick | @search_weight | action |
+-----+---------+----------------+---------+
| 128 | 66.540 | 35.123 | discard |
+-----+---------+----------------+---------+
权重66.540大于我们的搜索,所以128被舍弃,继续下一位。
mysql> set @search_id := @found_id + 64;
mysql> select id, fenwick, @search_weight,
-> if (fenwick <= @search_weight, "keep", "discard") as action
-> from entries where id = @search_id;
+----+---------+----------------+--------+
| id | fenwick | @search_weight | action |
+----+---------+----------------+--------+
| 64 | 33.950 | 35.123 | keep |
+----+---------+----------------+--------+
这里需要保留这一位(64),统计找到的权重:
set @found_id := @search_id, @search_weight := @search_weight - 33.950;
然后继续下一段:
mysql> set @search_id := @found_id + 32;
mysql> select id, fenwick, @search_weight,
-> if (fenwick <= @search_weight, "keep", "discard") as action
-> from entries where id = @search_id;
+----+---------+----------------+---------+
| id | fenwick | @search_weight | action |
+----+---------+----------------+---------+
| 96 | 16.260 | 1.173 | discard |
+----+---------+----------------+---------+
mysql> set @search_id := @found_id + 16;
mysql> select id, fenwick, @search_weight,
-> if (fenwick <= @search_weight, "keep", "discard") as action
-> from entries where id = @search_id;
+----+---------+----------------+---------+
| id | fenwick | @search_weight | action |
+----+---------+----------------+---------+
| 80 | 7.394 | 1.173 | discard |
+----+---------+----------------+---------+
mysql> set @search_id := @found_id + 8;
mysql> select id, fenwick, @search_weight,
-> if (fenwick <= @search_weight, "keep", "discard") as action
-> from entries where id = @search_id;
+----+---------+----------------+---------+
| id | fenwick | @search_weight | action |
+----+---------+----------------+---------+
| 72 | 3.995 | 1.173 | discard |
+----+---------+----------------+---------+
mysql> set @search_id := @found_id + 4;
mysql> select id, fenwick, @search_weight,
-> if (fenwick <= @search_weight, "keep", "discard") as action
-> from entries where id = @search_id;
+----+---------+----------------+---------+
| id | fenwick | @search_weight | action |
+----+---------+----------------+---------+
| 68 | 1.915 | 1.173 | discard |
+----+---------+----------------+---------+
mysql> set @search_id := @found_id + 2;
mysql> select id, fenwick, @search_weight,
-> if (fenwick <= @search_weight, "keep", "discard") as action
-> from entries where id = @search_id;
+----+---------+----------------+--------+
| id | fenwick | @search_weight | action |
+----+---------+----------------+--------+
| 66 | 1.146 | 1.173 | keep |
+----+---------+----------------+--------+
我们在这里又发现了一点
set @found_id := @search_id, @search_weight := @search_weight - 1.146;
mysql> set @search_id := @found_id + 1;
mysql> select id, fenwick, @search_weight,
-> if (fenwick <= @search_weight, "keep", "discard") as action
-> from entries where id = @search_id;
+----+---------+----------------+--------+
| id | fenwick | @search_weight | action |
+----+---------+----------------+--------+
| 67 | 0.010 | 0.027 | keep |
+----+---------+----------------+--------+
还有一个
set @found_id := @search_id, @search_weight := @search_weight - 0.010;
最终搜索结果为:
mysql> select @found_id, @search_weight;
+-----------+----------------+
| @found_id | @search_weight |
+-----------+----------------+
| 67 | 0.017 |
+-----------+----------------+
验证
mysql> select sum(weight) from entries where id <= 67;
+-------------+
| sum(weight) |
+-------------+
| 35.106 |
+-------------+
mysql> select sum(weight) from entries where id <= 68;
+-------------+
| sum(weight) |
+-------------+
| 35.865 |
+-------------+
确实,
35.106 (fenwick[67]) <= 35.123 (search) <= 35.865 (fenwick[68])
查找查找值一次解析1位,每个查找结果决定下一个要查找的ID的值。
这里给出的查询是为了说明。在实际应用程序中,代码应该只是一个包含以下内容的循环:
SELECT fenwick from entries where id = ?;
使用应用程序代码(或存储过程)实现与@found_id、@search_id 和@search_weight.
相关的逻辑
一般评论
- Fenwick 树用于前缀计算。如果要解决的问题首先涉及前缀,那么使用这些树才有意义。维基百科有一些应用指南。不幸的是,OP 没有描述 fenwick 树的用途。
- Fenwick 树是查找复杂性和更新复杂性之间的折衷,因此它们只对非静态数据有意义。对于静态数据,计算一次完整前缀会更高效。
- 执行的测试使用了 INNODB table,其主键索引已排序,因此计算 max_id 是一个简单的 O(1) 操作。如果 max_id 已经在其他地方可用,我认为即使在 ID 上使用带有 HASH 索引的 MEMORY table 也可以,因为只使用主键查找。
P.S.
sqlfiddle 今天宕机了,所以 post 使用原始数据(最初由 concat 提供)以便感兴趣的人可以重新 运行 测试。
INSERT INTO `entries` VALUES (1,0.480,0.480),(2,0.542,1.022),(3,0.269,0.269),(4,0.721,2.012),(5,0.798,0.798),(6,0.825,1.623),(7,0.731,0.731),(8,0.181,4.547),(9,0.711,0.711),(10,0.013,0.724),(11,0.930,0.930),(12,0.613,2.267),(13,0.276,0.276),(14,0.539,0.815),(15,0.867,0.867),(16,0.718,9.214),(17,0.991,0.991),(18,0.801,1.792),(19,0.033,0.033),(20,0.759,2.584),(21,0.698,0.698),(22,0.212,0.910),(23,0.965,0.965),(24,0.189,4.648),(25,0.049,0.049),(26,0.678,0.727),(27,0.245,0.245),(28,0.190,1.162),(29,0.214,0.214),(30,0.502,0.716),(31,0.868,0.868),(32,0.834,17.442),(33,0.566,0.566),(34,0.327,0.893),(35,0.939,0.939),(36,0.713,2.545),(37,0.747,0.747),(38,0.595,1.342),(39,0.733,0.733),(40,0.884,5.504),(41,0.218,0.218),(42,0.437,0.655),(43,0.532,0.532),(44,0.350,1.537),(45,0.154,0.154),(46,0.721,0.875),(47,0.140,0.140),(48,0.538,8.594),(49,0.271,0.271),(50,0.739,1.010),(51,0.884,0.884),(52,0.203,2.097),(53,0.361,0.361),(54,0.197,0.558),(55,0.903,0.903),(56,0.923,4.481),(57,0.906,0.906),(58,0.761,1.667),(59,0.089,0.089),(60,0.161,1.917),(61,0.537,0.537),(62,0.201,0.738),(63,0.397,0.397),(64,0.381,33.950),(65,0.715,0.715),(66,0.431,1.146),(67,0.010,0.010),(68,0.759,1.915),(69,0.763,0.763),(70,0.537,1.300),(71,0.399,0.399),(72,0.381,3.995),(73,0.709,0.709),(74,0.401,1.110),(75,0.880,0.880),(76,0.198,2.188),(77,0.348,0.348),(78,0.148,0.496),(79,0.693,0.693),(80,0.022,7.394),(81,0.031,0.031),(82,0.089,0.120),(83,0.353,0.353),(84,0.498,0.971),(85,0.428,0.428),(86,0.650,1.078),(87,0.963,0.963),(88,0.866,3.878),(89,0.442,0.442),(90,0.610,1.052),(91,0.725,0.725),(92,0.797,2.574),(93,0.808,0.808),(94,0.648,1.456),(95,0.817,0.817),(96,0.141,16.260),(97,0.256,0.256),(98,0.855,1.111),(99,0.508,0.508),(100,0.976,2.595),(101,0.353,0.353),(102,0.840,1.193),(103,0.139,0.139),(104,0.178,4.105),(105,0.469,0.469),(106,0.814,1.283),(107,0.664,0.664),(108,0.876,2.823),(109,0.390,0.390),(110,0.323,0.713),(111,0.442,0.442),(112,0.241,8.324),(113,0.881,0.881),(114,0.681,1.562),(115,0.760,0.760),(116,0.760,3.082),(117,0.518,0.518),(118,0.313,0.831),(119,0.008,0.008),(120,0.103,4.024),(121,0.488,0.488),(122,0.135,0.623),(123,0.207,0.207),(124,0.633,1.463),(125,0.542,0.542),(126,0.812,1.354),(127,0.433,0.433),(128,0.732,66.540),(129,0.358,0.358),(130,0.594,0.952),(131,0.897,0.897),(132,0.701,2.550),(133,0.815,0.815),(134,0.973,1.788),(135,0.419,0.419),(136,0.175,4.932),(137,0.620,0.620),(138,0.573,1.193),(139,0.004,0.004),(140,0.304,1.501),(141,0.508,0.508),(142,0.629,1.137),(143,0.618,0.618),(144,0.206,8.394),(145,0.175,0.175),(146,0.255,0.430),(147,0.750,0.750),(148,0.987,2.167),(149,0.683,0.683),(150,0.453,1.136),(151,0.219,0.219),(152,0.734,4.256),(153,0.016,0.016),(154,0.874,0.891),(155,0.325,0.325),(156,0.002,1.217);
P.S。 2
现在有了完整的 sqlfiddle:
补充一下 Marc 的回答,对于存储过程或函数,要求和的索引列表不能直接通过函数参数传递,我们可以在查询中生成索引,JOIN
求和查询:
SELECT SUM(fenwick) FROM entries
CROSS JOIN (SELECT @n:=60) a
INNER JOIN (
SELECT @n AS id, 0
UNION
SELECT
IF(@n&@bit<>0, @n:=@n-(@n&@bit), NULL) AS id,
(@bit:=@bit<<1)
FROM entries
CROSS JOIN (SELECT @bit:=1) a
LIMIT 32
) dt ON dt.id=entries.id;
我希望性能差不多,不再需要客户端生成索引。
上下文
我有一个从 table 中选择加权随机条目的应用程序,(权重的)前缀求和是一个关键部分。简化的 table 定义如下所示:
CREATE TABLE entries (
id INT NOT NULL PRIMARY KEY AUTO_INCREMENT,
weight DECIMAL(9, 3),
fenwick DECIMAL(9, 3)
) ENGINE=MEMORY;
其中 `fenwick`
将值存储在 `weights`
的 Fenwick 树表示中。
让每个条目的 "range" 跨越其前缀和及其前缀和 + 其权重。应用程序必须在 0
和 SUM(weight)
之间生成一个随机数 @r
并找到范围包含 @r
的条目,如下所示:
Fenwick 树与 MEMORY
引擎和二进制搜索相结合,应该允许我在 O(lg^2(n))
时间内找到合适的条目,而不是 O(n)
时间天真的查询:
SELECT a.id-1 FROM (SELECT *, (@x:=@x+weight) AS counter FROM entries
CROSS JOIN (SELECT @x:=0) a
HAVING counter>@r LIMIT 1) a;
研究
由于多个查询的开销,我一直在尝试将前缀和操作压缩到一个查询中(而不是在脚本语言中看到的多个数组访问)。在此过程中,我意识到传统的求和方法(涉及按降序键顺序访问元素)只会对第一个元素求和。当变量出现在 WHERE
子句中时,我怀疑 MySQL 线性运行通过 tables。这是查询:
SELECT
SUM(1) INTO @garbage
FROM entries
CROSS JOIN (
SELECT @sum:=0,
@n:=@entryid
) a
WHERE id=@n AND @n>0 AND (@n:=@n-(@n&(-@n))) AND (@sum:=@sum+entries.fenwick);
/*SELECT @sum*/
其中 @entryid
是我们正在计算其前缀和的条目的 ID。我确实创建了一个确实有效的查询(以及 returns 整数最左边的函数 lft
):
SET @n:=lft(@entryid);
SET @sum:=0;
SELECT
SUM(1) INTO @garbage
FROM entries
WHERE id=@n
AND @n<=@entryid
AND (@n:=@n+lft(@entryid^@n))
AND (@sum:=@sum+entries.fenwick);
/*SELECT @sum*/
但这只是证实了我对线性搜索的怀疑。 EXPLAIN
查询也是如此:
+------+-------------+---------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+---------+------+---------------+------+---------+------+--------+-------------+
| 1 | SIMPLE | entries | ALL | NULL | NULL | NULL | NULL | 752544 | Using where |
+------+-------------+---------+------+---------------+------+---------+------+--------+-------------+
1 row in set (0.00 sec)
索引:
SHOW INDEXES FROM entries;
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| entries | 0 | PRIMARY | 1 | id | NULL | 752544 | NULL | NULL | | HASH | | |
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
1 row in set (0.00 sec)
现在,我看到很多问题询问如何在 WHERE
子句中 消除 变量,以便优化器可以处理查询。但是,我想不出没有 id=@n
这个查询的方法。我考虑过将要求和的条目的键值放入 table 并使用连接,但我相信我会产生不良影响:过多的 table 或线性无论如何,通过对 @entryid
进行评估进行搜索。
问题
有没有任何方法强制MySQL使用这个查询的索引?如果他们提供此功能,我什至会尝试不同的 DBMS。
(因为有格式化选项,所以使用了答案框)。
正如 Rick 所说,引擎是这种情况下的主要问题。您可以通过在索引创建中使用 "USING BTREE" 来影响创建的索引类型(在这种情况下,BTREE 或 HASH 似乎并不重要:您遍历一个范围:然后 BTREE 是最佳的。无论您如何检索它每个值,然后 HASH 是最优的:您的查询具有两种行为)。
当您切换到 INNODB 时,缓存将使查询可能与内存中的查询一样快 table。然后您就可以享受索引的好处。为了保证 BTREE 索引,我将创建如下模式:
CREATE TABLE `entries` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`weight` decimal(9,3) DEFAULT NULL,
`fenwick` decimal(9,3) NOT NULL DEFAULT '0.000',
PRIMARY KEY (`id`)
) ENGINE=INNODB DEFAULT CHARSET=latin1;
CREATE UNIQUE INDEX idx_nn_1 ON entries (id,fenwick) USING BTREE;
这在主要计算中使用了 idx_nn_1 索引(而且只有索引:整个 table 根本没有被使用,因为所有数据都在索引中)。然而,100 条记录的样本量太小,无法给出有关性能的任何明确答案。与仅使用 table 访问的数据相比,构建索引所花费的时间可能会让您根本没有任何性能提升。所以最终的答案将在你的测试中。
其他数据库引擎(SQL Server、Oracle、Postgres):它们会表现出类似的行为。因此,切换到这些引擎中的任何一个都不会产生巨大的差异,除非通常是为了更好的处理(无法预测这一点)。
SQL 服务器在构建索引时可能会更好(=更快),因为那样只会在 id 上使用唯一索引并包含 fenwick 值,因此不必真正索引该值。
Oracle 确实可以强制索引,但是不建议这样做:在 Oracle 中,假设 table 中的有序数据,读取 table 比读取索引然后 table 进行查找。同样在这种情况下,您可以只添加 id、fenwick 索引而永远不会访问 table。考虑到索引创建时间,Oracle 无论如何都必须读取完整的 table 一次,并且在那个时候(或更少取决于达到退出条件需要多少记录)它也将执行您的计算.
Fenwick 树是否足够静态以预先计算一些东西?如果是这样,我可以给你一个几乎 O(1) 的解决方案:
- 构建 2 列 table (n, cumulative_sum)
- 预填充 table: (1, 0.851), (2, 2.574), (3, 2.916), (4, 3.817), ...
- 在 FLOAT 上创建索引
然后查找:
SELECT n FROM tbl WHERE cumulative_sum > 3.325
ORDER BY cumulative_sum LIMIT 1;
如果@variables 有问题,则让存储过程通过CONCAT
、PREPARE
和EXECUTE
.
附录
鉴于它是周期性的总替换,请在重建 table 时计算累计和。我的 SELECT
只查看一行,所以它是 O(1)(忽略 BTree 查找)。
对于"total replacement",我推荐:
CREATE TABLE new LIKE real;
load the data into `new` -- this is the slowest step
RENAME TABLE real TO old, new TO real; -- atomic and fast, so no "downtime"
DROP TABLE old;
免责声明
Fenwick 树对我来说是新的,我是在找到这个 post 时才发现它们的。 这里给出的结果是基于我的理解和一些研究, 但我绝不是芬威克树专家,我可能漏掉了一些东西。
参考资料
芬威克树工作原理说明
转载自 https://cs.stackexchange.com/a/10541/38148
https://cs.stackexchange.com/a/42816/38148
芬威克树的使用
https://en.wikipedia.org/wiki/Fenwick_tree
https://en.wikipedia.org/wiki/Prefix_sum
问题 1,找到给定条目的权重
鉴于以下 table
CREATE TABLE `entries` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`weight` decimal(9,3) DEFAULT NULL,
`fenwick` decimal(9,3) NOT NULL DEFAULT '0.000',
PRIMARY KEY (`id`)
) ENGINE=INNODB;
并且给定数据已经填充(参见 concat 提供的 http://sqlfiddle.com/#!9/be1f2/1),
如何计算给定条目的重量 @entryid
?
这里要理解的关键概念是,fenwick 指数的结构是基于 math 和 bitwise operations 对 id重视自己。
查询通常应仅使用主键查找 (WHERE ID = value
)。
任何使用排序 (ORDER BY
) 或范围 (WHERE (value1 < ID) AND (ID < value2))
的查询都没有抓住要点,并且没有按预期顺序遍历树。
例如,使用键 60:
SET @entryid := 60;
让我们把值60分解成二进制
mysql> SELECT (@entryid & 0x0080) as b8,
-> (@entryid & 0x0040) as b7,
-> (@entryid & 0x0020) as b6,
-> (@entryid & 0x0010) as b5,
-> (@entryid & 0x0008) as b4,
-> (@entryid & 0x0004) as b3,
-> (@entryid & 0x0002) as b2,
-> (@entryid & 0x0001) as b1;
+------+------+------+------+------+------+------+------+
| b8 | b7 | b6 | b5 | b4 | b3 | b2 | b1 |
+------+------+------+------+------+------+------+------+
| 0 | 0 | 32 | 16 | 8 | 4 | 0 | 0 |
+------+------+------+------+------+------+------+------+
1 row in set (0.00 sec)
换句话说,只保留设置的位,我们有
32 + 16 + 8 + 4 = 60
现在,逐一移除最低位以导航树:
32 + 16 + 8 + 4 = 60
32 + 16 + 8 = 56
32 + 16 = 48
32
这给出了访问元素 60 的路径 (32, 48, 56, 60)。
请注意,将 60
转换为 (32, 48, 56, 60)
只需要对 ID 值本身进行位运算:不需要访问 table 或数据库,并且可以完成此计算在发出查询的客户端中。
则60号元素的芬维克权重为
mysql> select sum(fenwick) from entries where id in (32, 48, 56, 60);
+--------------+
| sum(fenwick) |
+--------------+
| 32.434 |
+--------------+
1 row in set (0.00 sec)
验证
mysql> select sum(weight) from entries where id <= @entryid;
+-------------+
| sum(weight) |
+-------------+
| 32.434 |
+-------------+
1 row in set (0.00 sec)
现在,让我们比较一下这些查询的效率。
mysql> explain select sum(fenwick) from entries where id in (32, 48, 56, 60);
+----+-------------+---------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
| 1 | SIMPLE | entries | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 4 | 100.00 | Using where |
+----+-------------+---------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
或者,以不同的方式呈现
explain format=json select sum(fenwick) from entries where id in (32, 48, 56, 60);
{
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "5.61"
},
"table": {
"table_name": "entries",
"access_type": "range",
"possible_keys": [
"PRIMARY"
],
"key": "PRIMARY",
"used_key_parts": [
"id"
],
"key_length": "4",
"rows_examined_per_scan": 4,
"rows_produced_per_join": 4,
"filtered": "100.00",
"cost_info": {
"read_cost": "4.81",
"eval_cost": "0.80",
"prefix_cost": "5.61",
"data_read_per_join": "64"
},
"used_columns": [
"id",
"fenwick"
],
"attached_condition": "(`test`.`entries`.`id` in (32,48,56,60))"
}
}
因此,优化器通过主键获取了 4 行(IN 子句中有 4 个值)。
不使用芬威克指数时,我们有
mysql> explain select sum(weight) from entries where id <= @entryid;
+----+-------------+---------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
| 1 | SIMPLE | entries | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 60 | 100.00 | Using where |
+----+-------------+---------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
或者,以不同的方式呈现
explain format=json select sum(weight) from entries where id <= @entryid;
{
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "25.07"
},
"table": {
"table_name": "entries",
"access_type": "range",
"possible_keys": [
"PRIMARY"
],
"key": "PRIMARY",
"used_key_parts": [
"id"
],
"key_length": "4",
"rows_examined_per_scan": 60,
"rows_produced_per_join": 60,
"filtered": "100.00",
"cost_info": {
"read_cost": "13.07",
"eval_cost": "12.00",
"prefix_cost": "25.07",
"data_read_per_join": "960"
},
"used_columns": [
"id",
"weight"
],
"attached_condition": "(`test`.`entries`.`id` <= (@`entryid`))"
}
}
此处优化器执行索引扫描,读取 60 行。
在 ID=60 的情况下,fenwick 的好处是 4 次获取与 60 次相比。
现在,考虑一下它是如何缩放的,例如值高达 64K。
使用 fenwick,一个 16 位的值将最多设置 16 位,因此要查找的元素数量最多为 16 个。
如果没有 fenwick,一次扫描最多可以读取 64K 个条目(平均将读取 32K 个)。
问题 2,找到给定权重的条目
OP 问题是找到给定权重的条目。
例如
SET @search_weight := 35.123;
为了说明该算法,post 详细说明了查找是如何完成的(抱歉,如果这太冗长)
SET @found_id := 0;
首先,找出有多少条目。
SET @max_id := (select id from entries order by id desc limit 1);
在测试数据中,max_id为156。
因为128 <= max_id < 256,开始查找的最高位是128。
mysql> set @search_id := @found_id + 128;
mysql> select id, fenwick, @search_weight,
-> if (fenwick <= @search_weight, "keep", "discard") as action
-> from entries where id = @search_id;
+-----+---------+----------------+---------+
| id | fenwick | @search_weight | action |
+-----+---------+----------------+---------+
| 128 | 66.540 | 35.123 | discard |
+-----+---------+----------------+---------+
权重66.540大于我们的搜索,所以128被舍弃,继续下一位。
mysql> set @search_id := @found_id + 64;
mysql> select id, fenwick, @search_weight,
-> if (fenwick <= @search_weight, "keep", "discard") as action
-> from entries where id = @search_id;
+----+---------+----------------+--------+
| id | fenwick | @search_weight | action |
+----+---------+----------------+--------+
| 64 | 33.950 | 35.123 | keep |
+----+---------+----------------+--------+
这里需要保留这一位(64),统计找到的权重:
set @found_id := @search_id, @search_weight := @search_weight - 33.950;
然后继续下一段:
mysql> set @search_id := @found_id + 32;
mysql> select id, fenwick, @search_weight,
-> if (fenwick <= @search_weight, "keep", "discard") as action
-> from entries where id = @search_id;
+----+---------+----------------+---------+
| id | fenwick | @search_weight | action |
+----+---------+----------------+---------+
| 96 | 16.260 | 1.173 | discard |
+----+---------+----------------+---------+
mysql> set @search_id := @found_id + 16;
mysql> select id, fenwick, @search_weight,
-> if (fenwick <= @search_weight, "keep", "discard") as action
-> from entries where id = @search_id;
+----+---------+----------------+---------+
| id | fenwick | @search_weight | action |
+----+---------+----------------+---------+
| 80 | 7.394 | 1.173 | discard |
+----+---------+----------------+---------+
mysql> set @search_id := @found_id + 8;
mysql> select id, fenwick, @search_weight,
-> if (fenwick <= @search_weight, "keep", "discard") as action
-> from entries where id = @search_id;
+----+---------+----------------+---------+
| id | fenwick | @search_weight | action |
+----+---------+----------------+---------+
| 72 | 3.995 | 1.173 | discard |
+----+---------+----------------+---------+
mysql> set @search_id := @found_id + 4;
mysql> select id, fenwick, @search_weight,
-> if (fenwick <= @search_weight, "keep", "discard") as action
-> from entries where id = @search_id;
+----+---------+----------------+---------+
| id | fenwick | @search_weight | action |
+----+---------+----------------+---------+
| 68 | 1.915 | 1.173 | discard |
+----+---------+----------------+---------+
mysql> set @search_id := @found_id + 2;
mysql> select id, fenwick, @search_weight,
-> if (fenwick <= @search_weight, "keep", "discard") as action
-> from entries where id = @search_id;
+----+---------+----------------+--------+
| id | fenwick | @search_weight | action |
+----+---------+----------------+--------+
| 66 | 1.146 | 1.173 | keep |
+----+---------+----------------+--------+
我们在这里又发现了一点
set @found_id := @search_id, @search_weight := @search_weight - 1.146;
mysql> set @search_id := @found_id + 1;
mysql> select id, fenwick, @search_weight,
-> if (fenwick <= @search_weight, "keep", "discard") as action
-> from entries where id = @search_id;
+----+---------+----------------+--------+
| id | fenwick | @search_weight | action |
+----+---------+----------------+--------+
| 67 | 0.010 | 0.027 | keep |
+----+---------+----------------+--------+
还有一个
set @found_id := @search_id, @search_weight := @search_weight - 0.010;
最终搜索结果为:
mysql> select @found_id, @search_weight;
+-----------+----------------+
| @found_id | @search_weight |
+-----------+----------------+
| 67 | 0.017 |
+-----------+----------------+
验证
mysql> select sum(weight) from entries where id <= 67;
+-------------+
| sum(weight) |
+-------------+
| 35.106 |
+-------------+
mysql> select sum(weight) from entries where id <= 68;
+-------------+
| sum(weight) |
+-------------+
| 35.865 |
+-------------+
确实,
35.106 (fenwick[67]) <= 35.123 (search) <= 35.865 (fenwick[68])
查找查找值一次解析1位,每个查找结果决定下一个要查找的ID的值。
这里给出的查询是为了说明。在实际应用程序中,代码应该只是一个包含以下内容的循环:
SELECT fenwick from entries where id = ?;
使用应用程序代码(或存储过程)实现与@found_id、@search_id 和@search_weight.
相关的逻辑一般评论
- Fenwick 树用于前缀计算。如果要解决的问题首先涉及前缀,那么使用这些树才有意义。维基百科有一些应用指南。不幸的是,OP 没有描述 fenwick 树的用途。
- Fenwick 树是查找复杂性和更新复杂性之间的折衷,因此它们只对非静态数据有意义。对于静态数据,计算一次完整前缀会更高效。
- 执行的测试使用了 INNODB table,其主键索引已排序,因此计算 max_id 是一个简单的 O(1) 操作。如果 max_id 已经在其他地方可用,我认为即使在 ID 上使用带有 HASH 索引的 MEMORY table 也可以,因为只使用主键查找。
P.S.
sqlfiddle 今天宕机了,所以 post 使用原始数据(最初由 concat 提供)以便感兴趣的人可以重新 运行 测试。
INSERT INTO `entries` VALUES (1,0.480,0.480),(2,0.542,1.022),(3,0.269,0.269),(4,0.721,2.012),(5,0.798,0.798),(6,0.825,1.623),(7,0.731,0.731),(8,0.181,4.547),(9,0.711,0.711),(10,0.013,0.724),(11,0.930,0.930),(12,0.613,2.267),(13,0.276,0.276),(14,0.539,0.815),(15,0.867,0.867),(16,0.718,9.214),(17,0.991,0.991),(18,0.801,1.792),(19,0.033,0.033),(20,0.759,2.584),(21,0.698,0.698),(22,0.212,0.910),(23,0.965,0.965),(24,0.189,4.648),(25,0.049,0.049),(26,0.678,0.727),(27,0.245,0.245),(28,0.190,1.162),(29,0.214,0.214),(30,0.502,0.716),(31,0.868,0.868),(32,0.834,17.442),(33,0.566,0.566),(34,0.327,0.893),(35,0.939,0.939),(36,0.713,2.545),(37,0.747,0.747),(38,0.595,1.342),(39,0.733,0.733),(40,0.884,5.504),(41,0.218,0.218),(42,0.437,0.655),(43,0.532,0.532),(44,0.350,1.537),(45,0.154,0.154),(46,0.721,0.875),(47,0.140,0.140),(48,0.538,8.594),(49,0.271,0.271),(50,0.739,1.010),(51,0.884,0.884),(52,0.203,2.097),(53,0.361,0.361),(54,0.197,0.558),(55,0.903,0.903),(56,0.923,4.481),(57,0.906,0.906),(58,0.761,1.667),(59,0.089,0.089),(60,0.161,1.917),(61,0.537,0.537),(62,0.201,0.738),(63,0.397,0.397),(64,0.381,33.950),(65,0.715,0.715),(66,0.431,1.146),(67,0.010,0.010),(68,0.759,1.915),(69,0.763,0.763),(70,0.537,1.300),(71,0.399,0.399),(72,0.381,3.995),(73,0.709,0.709),(74,0.401,1.110),(75,0.880,0.880),(76,0.198,2.188),(77,0.348,0.348),(78,0.148,0.496),(79,0.693,0.693),(80,0.022,7.394),(81,0.031,0.031),(82,0.089,0.120),(83,0.353,0.353),(84,0.498,0.971),(85,0.428,0.428),(86,0.650,1.078),(87,0.963,0.963),(88,0.866,3.878),(89,0.442,0.442),(90,0.610,1.052),(91,0.725,0.725),(92,0.797,2.574),(93,0.808,0.808),(94,0.648,1.456),(95,0.817,0.817),(96,0.141,16.260),(97,0.256,0.256),(98,0.855,1.111),(99,0.508,0.508),(100,0.976,2.595),(101,0.353,0.353),(102,0.840,1.193),(103,0.139,0.139),(104,0.178,4.105),(105,0.469,0.469),(106,0.814,1.283),(107,0.664,0.664),(108,0.876,2.823),(109,0.390,0.390),(110,0.323,0.713),(111,0.442,0.442),(112,0.241,8.324),(113,0.881,0.881),(114,0.681,1.562),(115,0.760,0.760),(116,0.760,3.082),(117,0.518,0.518),(118,0.313,0.831),(119,0.008,0.008),(120,0.103,4.024),(121,0.488,0.488),(122,0.135,0.623),(123,0.207,0.207),(124,0.633,1.463),(125,0.542,0.542),(126,0.812,1.354),(127,0.433,0.433),(128,0.732,66.540),(129,0.358,0.358),(130,0.594,0.952),(131,0.897,0.897),(132,0.701,2.550),(133,0.815,0.815),(134,0.973,1.788),(135,0.419,0.419),(136,0.175,4.932),(137,0.620,0.620),(138,0.573,1.193),(139,0.004,0.004),(140,0.304,1.501),(141,0.508,0.508),(142,0.629,1.137),(143,0.618,0.618),(144,0.206,8.394),(145,0.175,0.175),(146,0.255,0.430),(147,0.750,0.750),(148,0.987,2.167),(149,0.683,0.683),(150,0.453,1.136),(151,0.219,0.219),(152,0.734,4.256),(153,0.016,0.016),(154,0.874,0.891),(155,0.325,0.325),(156,0.002,1.217);
P.S。 2
现在有了完整的 sqlfiddle:
补充一下 Marc 的回答,对于存储过程或函数,要求和的索引列表不能直接通过函数参数传递,我们可以在查询中生成索引,JOIN
求和查询:
SELECT SUM(fenwick) FROM entries
CROSS JOIN (SELECT @n:=60) a
INNER JOIN (
SELECT @n AS id, 0
UNION
SELECT
IF(@n&@bit<>0, @n:=@n-(@n&@bit), NULL) AS id,
(@bit:=@bit<<1)
FROM entries
CROSS JOIN (SELECT @bit:=1) a
LIMIT 32
) dt ON dt.id=entries.id;
我希望性能差不多,不再需要客户端生成索引。