Clustered Columnstore 上的 Rowstore 索引 - 基数估计错误?

Rowstore index on Clustered Columnstore - cardinality estimation mistake?

这个让我难住了。我有一个维度 table,其中包含大约 3000 万行。它是一个聚集列存储。此外,此 table 在其代理键上具有 INT 类型的主键约束。

检索代理键的 MIN() 的查询,对于给定的日期范围,如下所示:

SELECT
    MIN(DIM.OrderId)
FROM
    dbo.Dim_Order AS DIM
WHERE
    DIM.OrderDate >= CAST('2016-06-01' AS DATE)
    AND DIM.OrderDate < CAST('2016-07-01' AS DATE)
OPTION (MAXDOP 1);

这是输出:

Table 'Dim_Order'. Scan count 2, logical reads 833, physical reads 0, read-ahead reads 0, lob logical reads 1702561, lob physical reads 0, lob read-ahead reads 0.

Table 'Dim_Order'. Segment reads 304001, segment skipped 0.

(1 row affected)

SQL Server Execution Times: CPU time = 2829 ms, elapsed time = 2876 ms.

优化器选择使用非聚集主键并通过嵌套循环执行键查找,而不是使用列存储。更糟糕的是,它严重低估了返回的行数。

奇怪的是,行估计值似乎与日期范围的大小成反比。

╔════════════╦══════════════════════════╗
║ Date Range ║ Estimated Number of Rows ║
╠════════════╬══════════════════════════╣
║ 1 year     ║ 2.00311                  ║
║ 6 months   ║ 3.41584                  ║
║ 1 month    ║ 24.4459                  ║
║ 2 weeks    ║ 52.093                   ║
║ 1 week     ║ 99.9055                  ║
║ 3 days     ║ 217.632                  ║
║ 1 day      ║ 1088.16                  ║
╚════════════╩══════════════════════════╝

此版本带有 INDEX 提示,几乎可以立即运行:

SELECT
    MIN(DIM.OrderId)
FROM
    dbo.Dim_Order AS DIM WITH(INDEX=CCI_Dim_Order)
WHERE
    DIM.OrderDate >= CAST('2016-06-01' AS DATE)
    AND DIM.OrderDate < CAST('2016-07-01' AS DATE)
OPTION (MAXDOP 1);

Table 'Dim_Order'. Scan count 1, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 1004, lob physical reads 0, lob read-ahead reads 0.

Table 'Dim_Order'. Segment reads 2, segment skipped 0.

(1 row affected)

SQL Server Execution Times: CPU time = 0 ms, elapsed time = 1 ms.

我在以下版本中观察到此行为:

Microsoft SQL Server 2016 (RTM) - 13.0.1601.5 (X64)

Microsoft SQL Server 2016 (SP1-CU5) (KB4040714) - 13.0.4451.0 (X64)

下面的重现脚本将创建一个示例 table 并用 2 年的订单填充它,对于 2,000 名客户,每天一个订单。这在我们的 table 中计算出 1,462,000 个样本订单,跨越 24 个月,每个月大约有 60,000 行。脚本底部的示例查询旨在演示该行为。正如您将看到的,出于某种原因,行估计值非常低,优化器拒绝使用聚集列存储,除非得到提示。

感谢您对此提出的任何意见或建议。这是示例脚本。

DROP TABLE IF EXISTS dbo.Dim_Order

CREATE TABLE dbo.Dim_Order
    (
    OrderId INT NOT NULL
    , CustomerId INT NOT NULL
    , OrderDate DATE NOT NULL
    , OrderTotal decimal(5,2) NOT NULL
    );

WITH CTE_DATE AS
(
SELECT CAST('2016-01-01' AS DATE) AS DateValue
UNION ALL
SELECT
       DATEADD(DAY, 1, D.DateValue)
FROM
       CTE_DATE AS D
WHERE
       D.DateValue < CAST('2017-12-31' AS DATE)
),
CTE_CUSTOMER AS
(
SELECT 1 AS CustomerId
UNION ALL
SELECT
       CustomerId + 1
FROM
       CTE_CUSTOMER AS D
WHERE
       D.CustomerId < 2000
)
, CTE_FINAL
AS
(
SELECT
    ROW_NUMBER() OVER (ORDER BY DateValue ASC, CustomerId ASC) AS OrderId
    , CustomerId
    , DateValue AS OrderDate
    , CAST(ROUND(RAND(CHECKSUM(NEWID()))*(100-1)+1, 2) AS DECIMAL(5,2)) AS OrderTotal
FROM
    CTE_DATE
    CROSS JOIN CTE_CUSTOMER
)
INSERT INTO
    dbo.Dim_Order
    (
    OrderId
    , CustomerId
    , OrderDate
    , OrderTotal
    )
SELECT
    ORD.OrderId
    , ORD.CustomerId
    , ORD.OrderDate
    , ORD.OrderTotal
FROM
    CTE_FINAL AS ORD
OPTION (MAXRECURSION 32767);

CREATE CLUSTERED COLUMNSTORE INDEX CCI_Dim_Order ON dbo.Dim_Order;

ALTER INDEX CCI_Dim_Order ON dbo.Dim_Order
    REORGANIZE
    WITH (COMPRESS_ALL_ROW_GROUPS = ON)

ALTER TABLE dbo.Dim_Order
    ADD CONSTRAINT PK_Dim_Order PRIMARY KEY NONCLUSTERED (OrderId ASC);

RETURN;

SET STATISTICS IO ON
SET STATISTICS TIME ON

SELECT
    MIN(DIM.OrderId)
FROM
    dbo.Dim_Order AS DIM
WHERE
    DIM.OrderDate = CAST('2016-06-01' AS DATE)
    AND DIM.OrderDate < CAST('2016-07-01' AS DATE)
OPTION (MAXDOP 1);

SELECT
    MIN(DIM.OrderId)
FROM
    dbo.Dim_Order AS DIM WITH(INDEX=CCI_Dim_Order)
WHERE
    DIM.OrderDate >= CAST('2016-06-01' AS DATE)
    AND DIM.OrderDate < CAST('2016-07-01' AS DATE)
OPTION (MAXDOP 1);

这是一个典型的 row goal 基数估计问题。您可以添加 USE HINT ('DISABLE_OPTIMIZER_ROWGOAL') 以禁用行目标,并且应该会发现集群列存储现在成本更低且被选中。

该计划在 PK_Dim_Order 上进行了有序扫描 - 因为它按 OrderId 的顺序处理行并正在寻找 MIN(DIM.OrderId) 它可以在找到第一个后立即停止一个匹配 OrderDate 上的谓词 - 它假定匹配月份谓词的 60,000 行将均匀地分散在整个索引中。事实上,它们都在 ID 为 304001364000.

的连续范围内

这种不相关的假设也是估计的行数随着日期范围变大而下降的原因。如果将日期谓词的匹配行数加倍,并且它们真正均匀地分布在索引中,则在命中一个匹配两个谓词并停止扫描之前,您只需要读取一半的行数。