Oracle 数据库上字符串列的恒定时间索引

Question

我有一个订单 table。 table属于多租户应用，所以同一个table中有多个商户的订单。 table 存储了数亿条记录。这个问题有两个相关的栏目：

MerchantID，一个整数存储商户的唯一ID
TransactionID，一个标识交易的字符串

我想知道是否有一个高效的索引可以做到以下几点：

对每个 商户 ID 的 交易 ID 实施唯一约束。应在 恒定时间 .
执行恒定时间查询，涉及精确匹配两列（例如，SELECT * FROM <table> WHERE TransactionID = 'ff089f89feaac87b98a' AND MerchantID = 24）

更多信息：

我正在使用 Oracle 11g。也许 this Oracle article 与我的问题相关？
我无法更改列的数据类型。
constant time 表示索引在 O(1) 时间复杂度内执行。就像一个哈希图。

Answer 1

哈希簇可以提供 O(1) 的访问时间，但不能提供 O(1) 的约束执行时间。然而，在实践中，哈希簇的恒定访问时间比常规 b 树索引的 O(log N) 访问时间更差。此外，集群更难配置，并且对于某些操作而言不能很好地扩展。

创建哈希集群

drop table orders_cluster;
drop cluster cluster1;

create cluster cluster1
(
    MerchantID number,
    TransactionID varchar2(20)
)
single table hashkeys 10000; --This number is important, choose wisely!

create table orders_cluster
(
    id number,
    MerchantID number,
    TransactionID varchar2(20)
) cluster cluster1(merchantid, transactionid);

--Add 1 million rows.  20 seconds.
begin
    for i in 1 .. 10 loop
        insert into orders_cluster
        select rownum + i * 100000, mod(level, 100)+ i * 100000, level
        from dual connect by level <= 100000;
        commit;
    end loop;
end;
/

create unique index orders_cluster_idx on orders_cluster(merchantid, transactionid);

begin
    dbms_stats.gather_table_stats(user, 'ORDERS_CLUSTER');
end;
/

创建常规Table（用于比较）

drop table orders_table;

create table orders_table
(
    id number,
    MerchantID number,
    TransactionID varchar2(20)
) nologging;

--Add 1 million rows.  2 seconds.
begin
    for i in 1 .. 10 loop
        insert into orders_table
        select rownum + i * 100000, mod(level, 100)+ i * 100000, level
        from dual connect by level <= 100000;
        commit;
    end loop;
end;
/

create unique index orders_table_idx on orders_table(merchantid, transactionid);

begin
    dbms_stats.gather_table_stats(user, 'ORDERS_TABLE');
end;
/

跟踪示例

SQL*Plus Autotrace 是一种快速查找解释计划并跟踪每个语句 I/O activity 的方法。 I/O 请求的数量被标记为 "consistent gets" 并且是衡量已完成工作量的一种不错的方式。此代码演示了如何为其他部分生成数字。查询通常需要运行不止一次来预热。

SQL> set autotrace on;
SQL> select * from orders_cluster where merchantid = 100001 and transactionid = '2';

no rows selected


Execution Plan
----------------------------------------------------------
Plan hash value: 621801084

------------------------------------------------------------------------------------
| Id  | Operation         | Name           | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |                |     1 |    16 |     1   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS HASH| ORDERS_CLUSTER |     1 |    16 |     1   (0)| 00:00:01 |
------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("MERCHANTID"=100001 AND "TRANSACTIONID"='2')


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
         31  consistent gets
          0  physical reads
          0  redo size
        485  bytes sent via SQL*Net to client
        540  bytes received via SQL*Net from client
          1  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          0  rows processed

SQL>

找到最佳哈希键，权衡取舍

为了获得最佳的读取性能，所有的哈希冲突都应该放在一个块中（所有 Oracle I/O 都是在每个块中完成的，通常是 8K）。获得理想的正确存储是很棘手的，需要知道哈希算法、存储大小（与块大小不同）和哈希键（桶）的数量。 Oracle 有一个默认的算法和大小，因此可以只关注一个属性，即哈希键的数量。

更多的散列键导致更少的冲突。这对 TABLE ACCESS HASH 性能有好处，因为只有一个块要读取。以下是不同哈希键大小的一致获取数。为了进行比较，还包括索引访问。有了足够的哈希键，块的数量就会减少到最佳数量 1。

Method          Consistent Gets (for transactionid = 1, 20, 300, 4000, and 50000)
Index           4,  3,  3,  3,  3
Hashkeys 100    1, 31, 31, 31, 31
Hashkeys 1000   1,  3,  4,  4,  4
Hashkeys 10000  1,  1,  1,  1,  1

更多的散列键也会导致更多的存储桶，更多的浪费 space，以及更慢的 TABLE ACCESS FULL 操作。

Table type      Space in MB
HeapTable       24MB
Hashkeys 100    26MB
hashkeys 1000   30MB
hashkeys 10000  81MB

要重现我的结果，请使用类似 select * from orders_cluster where merchantid = 100001 and transactionid = '1'; 的示例查询并将最后一个值更改为 1、20、300、4000 和 50000。

性能比较

一致的获取是预测table 并且易于衡量，但在一天结束时只有挂钟时间很重要。令人惊讶的是，索引访问多了 4 倍 consistent gets 仍然比最优哈希集群场景更快。

--3.5 seconds for b-tree access.
declare
    v_count number;
begin
    for i in 1 .. 100000 loop
        select count(*)
        into v_count
        from orders_table
        where merchantid = 100000 and transactionid = '1';
    end loop;
end;
/

--3.8 seconds for hash cluster access.
declare
    v_count number;
begin
    for i in 1 .. 100000 loop
        select count(*)
        into v_count
        from orders_cluster
        where merchantid = 100000 and transactionid = '1';
    end loop;
end;
/

我也尝试了使用变量谓词的测试，但结果相似。

是否可以扩展？

不，散列集群无法扩展。尽管 TABLE ACCESS HASH 的时间复杂度为 O(1)，而 INDEX UNIQUE SCAN 的时间复杂度为 O(log n)，但哈希簇似乎永远不会胜过 b 树索引。

我用 1000 万行尝试了上面的示例代码。哈希集群的加载速度非常慢，并且在 SELECT 性能上仍然低于索引。我试图将其扩展到 1 亿行，但插入需要 11 天。

好消息是 b*trees 的扩展性很好。将 1 亿行添加到上面的示例只需要索引中的 3 个级别。我查看了大型数据库环境（数百个数据库和 1 PB 数据）的所有 DBA_INDEXES - 最差的索引只有 7 个级别。那是 VARCHAR2(4000) 列的病态索引。在大多数情况下，无论 table 大小如何，您的 b-tree 索引都将保持浅表。

在这种情况下，O(log n) 优于 O(1)。

但是为什么？

糟糕的散列集群性能可能是 Oracle 试图简化事情并隐藏使散列集群正常工作所需的细节的牺牲品。集群很难正确设置和使用，而且很少能提供显着的好处。在过去的几十年里，甲骨文并没有在他们身上投入太多的精力。

评论者认为简单的 b 树索引是最好的是正确的。但为什么这应该是真的并不明显，考虑数据库中使用的算法是件好事。

Oracle 数据库上字符串列的恒定时间索引

Constant-time index for string column on Oracle database

oracle

database-design

indexing

hashmap

database-performance