动态 SQL :: 计算每个索引的 NULL 百分比

Dynamic SQL :: Calculate percentage of NULLs per index

我有一个查询可以帮助我列出数据库中的所有索引并且运行良好:

SELECT TableName = t.name, 
       IndexName = ind.name, 
       IndexId = ind.index_id, 
       ColumnId = ic.index_column_id, 
       ColumnName = col.name,
       --(SELECT SUM(CASE WHEN col.name IS NULL THEN 1 ELSE 0 END) * 100.0 / count(*) FROM t.name) as nulls_percent,
       ind.*, 
       ic.*, 
       col.*
FROM sys.indexes ind
     INNER JOIN sys.index_columns ic ON ind.object_id = ic.object_id
                                        AND ind.index_id = ic.index_id
     INNER JOIN sys.columns col ON ic.object_id = col.object_id
                                   AND ic.column_id = col.column_id
     INNER JOIN sys.tables t ON ind.object_id = t.object_id
WHERE ind.is_primary_key = 0
      AND ind.is_unique = 0
      AND ind.is_unique_constraint = 0
      AND t.is_ms_shipped = 0
ORDER BY t.name, 
         ind.name, 
         ind.index_id, 
         ic.is_included_column, 
         ic.key_ordinal;

不幸的是,如果我取消注释第 6 行,单词 t.name 带有红色下划线,如果我 运行 查询我收到错误:

Invalid object name 'TableName'.

如何使这个子查询起作用?

目标是在每列上有 NULLs 的百分比

据我所知,您有一个来自各种 table 的列的列表,对于这些列中的每一列,您都希望确定 NULL 的百分比。

这里有类似的问答SQL Query that grabs number of nulls from each column for a DB

关于循环 - 正如您将从一个 table 中读取的那样,然后是下一个,然后是下一个...根据@Gordon 的评论,需要使用某种循环(好吧,除了分别编写每个命令)。

基本方法是

  • 创建您要审问的 tables/columns 的列表
  • 创建一个table来容纳results/output
  • 动态创建 SQL 以读取行数和 NULL 数,并将其保存在您的 results/output table

在我的回答中,我建议创建一个宽 table(有很多对列 - 一个用于列名,一个用于 NOT NULL 的数量)例如,col1_name、col1_num, col2_name, col2_num, 等等

上面的优点是性能 - 你可以只使用一个完整的 table 读取来完成所有数字 c运行ching 然后将值放入结果 table.

OP 选择了一种略有不同的方法,他们一次询问每一列。虽然性能较低(例如,需要多次读取每个 table - 每列一次),但它

  • 提供更清晰的输出(例如,table 名称、列名、行数、NULL 数)
  • 允许其他统计数据 运行 例如,获取最小值和最大值

如果您只需要 运行 它一次来获取数据库的快照,并且可以接受暂时的性能影响,那么在 1-row-per-table-column 中得到您的答案是一个不错的解决方案.

根据上一个问题的 'good bits',我建议

  • 创建一个临时的 table(或者甚至在新数据库中专门为此创建一个普通的 table),其中包含字段 Database_name、Schema_name、Table_name、Column_name。这是循环的输入。
  • 填写您要查看的列
  • 再做一个table来存储输出,用Database_name、Schema_name、Table_name、Column_name、Num_Rows、Num_NotNULL
  • 写一个循环,将下一列名称等放入变量中
  • 运行类似于
  • 的命令
-- Assume @Database_Name, @Schema_Name, @Table_Name, @Column_Name have been taken from the loop
SET @CustomSQL = '
    INSERT INTO TableResults (Database_Name, Schema_name, Table_Name, Column_Name, Num_Rows, Num_NonNull)
    SELECT  ''' + @Database_Name + ''', 
            ''' + @Schema_Name + ''', 
            ''' + @Table_Name + ''', 
            ''' + @Column_Name + ''',
            COUNT(*) AS Num_Rows, 
            COUNT(' + QUOTENAME(@Column_Name) + ') AS Num_NotNULL
    FROM    ' + QUOTENAME(@Database_Name) + '.' + QUOTENAME(@Schema_Name) + '.' + QUOTENAME(@TableName);

EXEC (@CustomSQL);

一个示例 SQL 创建的命令是

INSERT INTO TableResults (Database_Name, Schema_name, Table_Name, Column_Name, Num_Rows, Num_NonNull)
    SELECT  'SalesDB', 
            'dbo',
            'ProductList', 
            'ProductName',
            COUNT(*) AS Num_Rows, 
            COUNT([ProductName]) AS Num_NotNULL
    FROM    [SalesDB].[dbo].[Products];

当然,您随后需要进行简单的数学计算以获得所需的数字,例如,要获得 NULL 的数量,请从行数中减去 NotNULL 的数量。

像这样:

DROP TABLE IF EXISTS #TEST;

SELECT TableName = t.name, 
       IndexName = ind.name, 
       IndexId = ind.index_id, 
       ColumnId = ic.index_column_id, 
       ColumnName = col.name,
       --(SELECT SUM(CASE WHEN col.name IS NULL THEN 1 ELSE 0 END) * 100.0 / count(*) FROM t.name) as nulls_percent,
       CAST(-1 AS DECIMAL(9,2)) AS nulls_percent--,
       --ind.*, 
       --ic.*, 
       --col.*
INTO #TEST
FROM sys.indexes ind
     INNER JOIN sys.index_columns ic ON ind.object_id = ic.object_id
                                        AND ind.index_id = ic.index_id
     INNER JOIN sys.columns col ON ic.object_id = col.object_id
                                   AND ic.column_id = col.column_id
     INNER JOIN sys.tables t ON ind.object_id = t.object_id
WHERE ind.is_primary_key = 0
      AND ind.is_unique = 0
      AND ind.is_unique_constraint = 0
      AND t.is_ms_shipped = 0
ORDER BY t.name, 
         ind.name, 
         ind.index_id, 
         ic.is_included_column, 
         ic.key_ordinal;


DECLARE @DynamicTSQLSTatement NVARCHAR(MAX)
       ,@TableName SYSNAME
       ,@ColumnID INT
       ,@nulls_percent DECIMAL(9,2);

WHILE EXISTS(SELECT 1 FROM #TEST WHERE [nulls_percent] = -1)
BEGIN;

    SELECT TOP 1 @DynamicTSQLSTatement = '(SELECT @nulls_percent = SUM(CASE WHEN ' + ColumnName + ' IS NULL THEN 1 ELSE 0 END) * 100.0 / count(*) FROM ' + TableName+' )'
                ,@TableName = [TableName]
                ,@ColumnID = [ColumnId]
    FROM #TEST 
    WHERE [nulls_percent] = -1;

    --SELECT @DynamicTSQLSTatement

    EXEC sp_executesql @DynamicTSQLSTatement, N'@nulls_percent DECIMAL(9,2) OUTPUT', @nulls_percent = @nulls_percent OUTPUT;


    UPDATE #TEST
    SET nulls_percent = ISNULL(@nulls_percent,0)
    WHERE [TableName] = @TableName
        AND [ColumnId] = @ColumnID;

END;

SELECT *
FROM #TEST;

当然你需要改进它。例如,也添加每个 table 的 schema 名称。

您的问题和查询非常有趣,可以借助统计数据解决,这些统计数据始终存在于参与索引的列,就像您在查询中选择的列一样。

为了获得准确的结果,最好在 运行 我在下面提供的查询之前更新统计信息。

;WITH cteColumnAllStats AS
(
    SELECT
        ST_COL.object_id,
        ST_COL.column_id,
        ST_COL.stats_id,

        -- NOTE: order no among stats of the same column
        ROW_NUMBER()
            OVER(
                PARTITION BY
                    ST_COL.object_id,
                    ST_COL.column_id
                ORDER BY
                    ST_COL.stats_id
            ) AS StatsOrderNo
    FROM sys.stats ST
    INNER JOIN sys.stats_columns ST_COL
        ON  ST_COL.stats_id = ST.stats_id
        AND ST_COL.object_id = ST.object_id
)
,cteColumnFirstStats AS
(
    SELECT
        ST_COL.object_id,
        ST_COL.column_id,

        -- NOTES:
        -- =====
        -- this would be null if there were no statistics for the column
        -- however not in this case because we are only considering columns
        -- participating in an index and all indices have statistics behind
        -- the scenes.
        --
        -- Also consider whether the statistics have been updated:
        -- If they have, the result will be a whole number (without decimals)
        -- and the result is exact.
        -- If they have not, the result is an estimate and in most of the cases
        -- there will be decimals or even produce a negative result.
        --
        -- If you want accurate results, you need to update the statistics:
        -- EXEC sp_updatestats
        --
        SUM(ST_HIST.range_rows) + SUM(ST_HIST.equal_rows) AS NonNullsRowCount
    FROM cteColumnAllStats ST_COL

    -- NOTE: this is the important bit
    CROSS APPLY sys.dm_db_stats_histogram(
        ST_COL.object_id,
        ST_COL.stats_id
    ) ST_HIST

    WHERE   ST_COL.StatsOrderNo = 1 -- take only the first stats for the column
    GROUP BY
        ST_COL.object_id,
        ST_COL.column_id
)
    SELECT TableName = t.name, 
           IndexName = ind.name, 
           IndexId = ind.index_id, 
           ColumnId = ic.index_column_id, 
           ColumnName = col.name,

           -- NOTE: included these columns for reference purposes (PLEASE REMOVE)
           SIND.rowcnt AS [RowCount],
           ST_COL.NonNullsRowCount,
           SIND.rowcnt - ST_COL.NonNullsRowCount AS NullsRowCount,

           --(SELECT SUM(CASE WHEN col.name IS NULL THEN 1 ELSE 0 END) * 100.0 / count(*) FROM t.name) as nulls_percent,
           CASE
                -- NOTE: stats are definitely out of date
                WHEN SIND.rowcnt < ST_COL.NonNullsRowCount THEN NULL

                -- NOTE: stats could be out of date (good to update them first)
                -- Also we don't want a divide by 0 hence the NULLIF
                ELSE (SIND.rowcnt - ST_COL.NonNullsRowCount) * 100.0 / NULLIF(SIND.rowcnt, 0)
           END as nulls_percent,

           ind.*, 
           ic.*, 
           col.*
    FROM sys.indexes ind
         INNER JOIN sys.index_columns ic ON ind.object_id = ic.object_id
                                            AND ind.index_id = ic.index_id
         INNER JOIN sys.columns col ON ic.object_id = col.object_id
                                       AND ic.column_id = col.column_id
         INNER JOIN sys.tables t ON ind.object_id = t.object_id

    -- NOTE: this gives you the COUNT(*) without querying the table
    INNER JOIN sys.sysindexes SIND
        ON  SIND.id = t.object_id

        -- NOTE:
        -- 0 means Heap
        -- 1 means Clustered Index
        -- Only these are reliable to use their rowcnt.
        -- There's always 1 of these and not the other.
        AND SIND.indid < 2

    -- NOTE: inner join is OK here because all columns participating in a index
    -- have associated statistics
    INNER JOIN cteColumnFirstStats ST_COL
        ON  ST_COL.object_id = t.object_id
        AND ST_COL.column_id = col.column_id

    WHERE ind.is_primary_key = 0
          AND ind.is_unique = 0
          AND ind.is_unique_constraint = 0
          AND t.is_ms_shipped = 0
    ORDER BY t.name, 
             ind.name, 
             ind.index_id, 
             ic.is_included_column, 
             ic.key_ordinal;