列验证工作非常缓慢(SQL 服务器)

Validation for columns work very slow (SQL Server)

我想对 table 的列执行数据分析。在这种特殊情况下 - 数据的百分比是 date/integer/numeric/bit。我正在使用的查询:

SELECT 
CAST(SUM(CASE WHEN TRY_CAST([column1] AS date) IS NOT NULL AND TRY_CAST(TRY_CAST([column1] AS VARCHAR(8000)) AS date) between '1950-01-01' AND '2049-12-31' AND LEN(RTRIM(CAST([column1] AS VARCHAR(MAX)))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))/(CAST(SUM(CASE WHEN [column1] IS NOT NULL AND LOWER(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) not IN ('null', 'n/a') AND LEN(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))+0.00000001) AS PercentDate,
    CAST(SUM(CASE WHEN TRY_CAST([column1] AS FLOAT) IS NOT NULL AND LEN(RTRIM(CAST([column1] AS VARCHAR(MAX)))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))/(CAST(SUM(CASE WHEN [column1] IS NOT NULL AND LOWER(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) not IN ('null', 'n/a') AND LEN(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))+0.00000001) AS PercentNumeric,
    CAST(SUM(CASE WHEN TRY_CAST([column1] AS BIGINT) IS NOT NULL AND LEN(RTRIM(CAST([column1] AS VARCHAR(MAX)))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))/(CAST(SUM(CASE WHEN [column1] IS NOT NULL AND LOWER(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) not IN ('null', 'n/a') AND LEN(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))+0.00000001) AS PercentInteger,
    CAST(SUM(CASE WHEN LOWER(TRY_CAST([column1] AS VARCHAR(MAX))) IN ('1', '0', 't', 'f', 'y', 'n', 'true', 'false', 'yes', 'no') THEN 1 ELSE 0 END) AS NUMERIC(25,2))/(CAST(SUM(CASE WHEN [column1] IS NOT NULL AND LOWER(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) not IN ('null', 'n/a') AND LEN(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))+0.00000001) AS PercentBit
    FROM tbl

即使我只选择前 1 行,此查询的运行速度也非常慢。其实我得不到任何结果,至少我等不了这么久。 如果这很重要,我正在检查的列是 decimal 类型。

table中的记录数为:37,431,866。这就是为什么我只选择前 1000 个,但仍然超过 40 分钟没有加载任何结果

如果您希望它 运行 更快,那么您不想限制您正在使用的查询中的行。毕竟没有GROUP BY的聚合查询只有returns一行。

改为使用子查询:

SELECT . . .
FROM (SELECT TOP (1000) t.*
      FROM tbl t
     ) t

请注意,这不是随机样本。如果你尝试 ORDER BY newid() 你会降低性能。获得近似 n% 样本的一种替代方法是使用如下逻辑:

SELECT . . .
FROM (SELECT TOP (1000) t.*
      FROM tbl t
      WHERE RAND(CHECKSUM(NEWID())) < 0.001
     ) t

0.001 大约是 0.1% 的样本。

你的问题可以简单化。部分:

CAST(SUM(CASE WHEN TRY_CAST([column1] AS date) IS NOT NULL AND TRY_CAST(TRY_CAST([column1] AS VARCHAR(8000)) AS date) between '1950-01-01' AND '2049-12-31' AND LEN(RTRIM(CAST([column1] AS VARCHAR(MAX)))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))

也可以写成:

CAST(SUM(CASE WHEN TRY_CAST(TRY_CAST([column1] AS VARCHAR(8000)) AS date) between '1950-01-01' AND '2049-12-31' THEN 1 ELSE 0 END) AS NUMERIC(25,2))

第二个比第一个快,结果一样。 (据我所知)

这可能也适用于查询中的其他部分。