识别 table 中的重复项:寻找查询建议
Identifying duplicates within a table: looking for query advice
所以我正在尝试识别帐户中重复的联系人记录,并寻找执行此操作的最佳方法。有一个帐户 table 和一个联系人 table。下面是我想出的查询,可以提供我需要的东西,但我觉得可能有 better/more 有效的方法来做到这一点,所以寻找任何 feedback/advice。提前致谢!
SELECT * FROM sysdba.CONTACT a WITH(NOLOCK)
WHERE EXISTS
(
SELECT ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL FROM sysdba.CONTACT b WITH(NOLOCK)
GROUP BY ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL
HAVING COUNT(*) > 1
AND a.ACCOUNTID = b.ACCOUNTID AND a.FIRSTNAME = b.FIRSTNAME AND a.LASTNAME = b.LASTNAME AND a.EMAIL = b.EMAIL
)
ORDER BY ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL
这是我可以执行此操作的另一种方法,但不得不使用 DISTINCT 似乎很丑陋..
SELECT DISTINCT a.CONTACTID, a.FIRSTNAME, a.LASTNAME, a.EMAIL FROM sysdba.CONTACT a WITH(NOLOCK)
JOIN sysdba.CONTACT b WITH(NOLOCK)
ON a.ACCOUNTID = b.ACCOUNTID AND a.FIRSTNAME = b.FIRSTNAME AND a.LASTNAME = b.LASTNAME AND a.EMAIL = b.EMAIL AND a.CONTACTID != b.CONTACTID
ORDER BY a.CONTACTID, a.FIRSTNAME, a.LASTNAME, a.EMAIL
在检查两者的执行计划时,第一个查询是 37%,而第二个查询是 63%,这令人惊讶,因为我一直(显然是错误的)使用联接比依赖更快一个 where 子句。
当您尝试识别重复项时,很常见的做法是使用窗口聚合函数,例如 COUNT() OVER (...)
和 ROW_NUMBER() OVER (...)
。
下面是应该 return 记录组的查询,其中有多个 CONTACTID
相同的 ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL
组合。换句话说,这个查询 returns 记录,具有重复项,以及它们的重复项:
;WITH cteCONTACT
AS (
SELECT ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL, CONTACTID,
CNT = COUNT(*) OVER (PARTITION BY ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL)
FROM sysdba.CONTACT
)
SELECT ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL, CONTACTID
FROM cteCONTACT
WHERE CNT > 1;
下面的查询应该 return 只重复,没有重复的记录是:
;WITH cteCONTACT
AS (
SELECT ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL, CONTACTID,
NUM = ROW_NUMBER() OVER (
PARTITION BY ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL
ORDER BY CONTACTID)
FROM sysdba.CONTACT
)
SELECT ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL, CONTACTID
FROM cteCONTACT
WHERE NUM > 1;
所以我正在尝试识别帐户中重复的联系人记录,并寻找执行此操作的最佳方法。有一个帐户 table 和一个联系人 table。下面是我想出的查询,可以提供我需要的东西,但我觉得可能有 better/more 有效的方法来做到这一点,所以寻找任何 feedback/advice。提前致谢!
SELECT * FROM sysdba.CONTACT a WITH(NOLOCK)
WHERE EXISTS
(
SELECT ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL FROM sysdba.CONTACT b WITH(NOLOCK)
GROUP BY ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL
HAVING COUNT(*) > 1
AND a.ACCOUNTID = b.ACCOUNTID AND a.FIRSTNAME = b.FIRSTNAME AND a.LASTNAME = b.LASTNAME AND a.EMAIL = b.EMAIL
)
ORDER BY ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL
这是我可以执行此操作的另一种方法,但不得不使用 DISTINCT 似乎很丑陋..
SELECT DISTINCT a.CONTACTID, a.FIRSTNAME, a.LASTNAME, a.EMAIL FROM sysdba.CONTACT a WITH(NOLOCK)
JOIN sysdba.CONTACT b WITH(NOLOCK)
ON a.ACCOUNTID = b.ACCOUNTID AND a.FIRSTNAME = b.FIRSTNAME AND a.LASTNAME = b.LASTNAME AND a.EMAIL = b.EMAIL AND a.CONTACTID != b.CONTACTID
ORDER BY a.CONTACTID, a.FIRSTNAME, a.LASTNAME, a.EMAIL
在检查两者的执行计划时,第一个查询是 37%,而第二个查询是 63%,这令人惊讶,因为我一直(显然是错误的)使用联接比依赖更快一个 where 子句。
当您尝试识别重复项时,很常见的做法是使用窗口聚合函数,例如 COUNT() OVER (...)
和 ROW_NUMBER() OVER (...)
。
下面是应该 return 记录组的查询,其中有多个 CONTACTID
相同的 ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL
组合。换句话说,这个查询 returns 记录,具有重复项,以及它们的重复项:
;WITH cteCONTACT
AS (
SELECT ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL, CONTACTID,
CNT = COUNT(*) OVER (PARTITION BY ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL)
FROM sysdba.CONTACT
)
SELECT ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL, CONTACTID
FROM cteCONTACT
WHERE CNT > 1;
下面的查询应该 return 只重复,没有重复的记录是:
;WITH cteCONTACT
AS (
SELECT ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL, CONTACTID,
NUM = ROW_NUMBER() OVER (
PARTITION BY ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL
ORDER BY CONTACTID)
FROM sysdba.CONTACT
)
SELECT ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL, CONTACTID
FROM cteCONTACT
WHERE NUM > 1;