使用递归 CTE 函数检查每一行与其他每一行
Using a Recursive CTE function to check every row against every other row
我正在尝试更新 table,以便我标记任何具有重复名称条目的条目。我做了一些处理以删除一些常见的前缀和后缀,然后可以 运行 两个名称与模糊匹配 CLR 相互比较。我把它写成一个嵌套游标,目前需要大约 4 个小时来 运行 遍历所有记录,因为我必须对照每一行检查每一行。我读过使用递归 CTE 可以显着提高性能,但是我是一个 SQL 菜鸟,不能完全让它工作。我想我需要将一个递归 CTE 嵌套到另一个递归 CTE 中,但不确定如何。
目前我有这样的东西:
;WITH AllOrgs (CompanyId, CompanyRoleId, Name, Recognized, Level)
AS
(
SELECT C.CompanyId, C.CompanyRoleId, C.Name, C.Recognized, 1
FROM Company O
WHERE DuplicateOfCompanyId IS NULL
UNION ALL
SELECT C.CompanyId, C.CompanyRoleId, C.Name, R.Recognized, R.Level + 1
FROM AllOrgs R INNER JOIN Company C
ON C.CompanyId = R.CompanyId
),
DuplicateOrgs (CompanyId, CompanyRoleId, Name, Recognized, Level)
As
(
SELECT * FROM AllOrgs
WHERE Recognized = 0 -- Recognized is what the companies are marked when we are satisfied they aren't incorrect
)
UPDATE O
SET C.DuplicateOfCompanyId = A.CompanyId
FROM Company O JOIN DuplicateOrgs A
ON C.CompanyId = A.CompanyID
WHERE master.dbo.fnClrFuzzyMatch(dbo.fnCleanUpCompanyName(A.Name), dbo.fnCleanUpCompanyName(C.Name))
> @CompanyNameMatchValueThreshold
AND A.CompanyRoleID = C.CompanyRoleId -- Role ID must match as duplicates who provide a different function are fine
但每当我尝试 运行 时,我都会得到一个 "The statement terminated. The maximum recursion 100 has been exhausted before statement completion." 所以我显然在做一些愚蠢的事情。
您的递归不会终止,因为您总是在新级别插入锚值本身。公司中只有一行的示例:
执行锚点后的AllOrgs:
CompanyId1, CompanyRoleId1, name1, Recognized1, 1
递归 1 后的 AllOrgs:
CompanyId1, CompanyRoleId1, name1, Recognized1, 1
CompanyId1、CompanyRoleId1、name1、Recognized1、2
递归 2 后的 AllOrgs:
CompanyId1, CompanyRoleId1, name1, Recognized1, 1
CompanyId1, CompanyRoleId1, name1, Recognized1, 2
CompanyId1, CompanyRoleId1, name1, Recognized1, 3
...
改为尝试自连接:
UPDATE C
SET DuplicateOfCompanyId = Dup.CompanyId
FROM Company C
JOIN Company Dup ON C.CompanyId <> Dup.CompanyID
AND master.dbo.fnClrFuzzyMatch(dbo.fnCleanUpCompanyName(C.Name), dbo.fnCleanUpCompanyName(Dup.Name)) > @CompanyNameMatchValueThreshold
AND C.CompanyRoleID = Dup.CompanyRoleId
注意:如果一家公司有多个重复项,则 duplicateOfCompanyId 可能是任意的且不一致。
我正在尝试更新 table,以便我标记任何具有重复名称条目的条目。我做了一些处理以删除一些常见的前缀和后缀,然后可以 运行 两个名称与模糊匹配 CLR 相互比较。我把它写成一个嵌套游标,目前需要大约 4 个小时来 运行 遍历所有记录,因为我必须对照每一行检查每一行。我读过使用递归 CTE 可以显着提高性能,但是我是一个 SQL 菜鸟,不能完全让它工作。我想我需要将一个递归 CTE 嵌套到另一个递归 CTE 中,但不确定如何。
目前我有这样的东西:
;WITH AllOrgs (CompanyId, CompanyRoleId, Name, Recognized, Level)
AS
(
SELECT C.CompanyId, C.CompanyRoleId, C.Name, C.Recognized, 1
FROM Company O
WHERE DuplicateOfCompanyId IS NULL
UNION ALL
SELECT C.CompanyId, C.CompanyRoleId, C.Name, R.Recognized, R.Level + 1
FROM AllOrgs R INNER JOIN Company C
ON C.CompanyId = R.CompanyId
),
DuplicateOrgs (CompanyId, CompanyRoleId, Name, Recognized, Level)
As
(
SELECT * FROM AllOrgs
WHERE Recognized = 0 -- Recognized is what the companies are marked when we are satisfied they aren't incorrect
)
UPDATE O
SET C.DuplicateOfCompanyId = A.CompanyId
FROM Company O JOIN DuplicateOrgs A
ON C.CompanyId = A.CompanyID
WHERE master.dbo.fnClrFuzzyMatch(dbo.fnCleanUpCompanyName(A.Name), dbo.fnCleanUpCompanyName(C.Name))
> @CompanyNameMatchValueThreshold
AND A.CompanyRoleID = C.CompanyRoleId -- Role ID must match as duplicates who provide a different function are fine
但每当我尝试 运行 时,我都会得到一个 "The statement terminated. The maximum recursion 100 has been exhausted before statement completion." 所以我显然在做一些愚蠢的事情。
您的递归不会终止,因为您总是在新级别插入锚值本身。公司中只有一行的示例:
执行锚点后的AllOrgs:
CompanyId1, CompanyRoleId1, name1, Recognized1, 1
递归 1 后的 AllOrgs:
CompanyId1, CompanyRoleId1, name1, Recognized1, 1
CompanyId1、CompanyRoleId1、name1、Recognized1、2
递归 2 后的 AllOrgs:
CompanyId1, CompanyRoleId1, name1, Recognized1, 1
CompanyId1, CompanyRoleId1, name1, Recognized1, 2
CompanyId1, CompanyRoleId1, name1, Recognized1, 3
...
改为尝试自连接:
UPDATE C
SET DuplicateOfCompanyId = Dup.CompanyId
FROM Company C
JOIN Company Dup ON C.CompanyId <> Dup.CompanyID
AND master.dbo.fnClrFuzzyMatch(dbo.fnCleanUpCompanyName(C.Name), dbo.fnCleanUpCompanyName(Dup.Name)) > @CompanyNameMatchValueThreshold
AND C.CompanyRoleID = Dup.CompanyRoleId
注意:如果一家公司有多个重复项,则 duplicateOfCompanyId 可能是任意的且不一致。