如何根据某些列值从 SQL 服务器 table 中查找重复条目?
How to find duplicate entries from a SQL Server table based on some column value?
这是我的 SQL 服务器 table - 我需要根据客户姓名、出生日期、父亲姓名列查找重复的客户代码:
Customer code Customer Name Date of birth Father Name
-------------------------------------------------------------
0001 Md. Alam 1991-10-20 Sr. Alam
0002 Alam 1991-10-20 Sr. alam
0004 Hasan 1990-01-01 Sr. Hasan
0005 Karim 1988-01-01 Sr. Karim
0006 Karim 1988-01-01 S Karim
0007 Kalam 1985-01-01 Sr. Kalam
输出看起来像:
0001,0002,0005,0006 客户重复,因为客户姓名、出生日期和父亲姓名列相似。
Customer code Customer Name Date of birth Father Name
------------------------------------------------------------
0001 Md. Alam 1991-10-20 Sr. Alam
0002 Alam 1991-10-20 Sr. alam
0005 Karim 1988-01-01 Sr. Karim
0006 Karim 1988-01-01 S Karim
请帮我找出一个有效的方法
select * from customer where code in(
select c1.code from customer c1 inner join customer c2 on c1.dob = c2.dob and
(c1.father = c2.father or c1.name = c2.name)
group by c1.code having count(c1.code) > 1
SELECT t.Customer_Code
,t.Customer_Name
FROM tb2 AS t
WHERE t.Customer_Name IN (
SELECT t.Customer_Name
FROM tb2 AS t
GROUP BY t.Customer_Name
HAVING COUNT(t.Customer_Name) > 1
)
ORDER BY t.Customer_Code ASC
请尝试以下解决方案。
- 1st CTE 正在清理两列 CustomerName 和 FatherName.
- 2nd CTE 正在根据 3 列的组合创建存储桶:
CustomerName、DOB 和 FatherName.
- 3rd CTE 正在创建存储桶中行的计数器,即查找重复项。
- 最终
SELECT
正在加入原始 table 并过滤掉
不重复行。
SQL
-- DDL and sample data population, start
DECLARE @tbl TABLE (CustomerCode CHAR(4), CustomerName VARCHAR(20), DOB DATE, FatherName VARCHAR(20));
INSERT INTO @tbl (CustomerCode, CustomerName, DOB, FatherName) VALUES
('0001', 'Md. Alam', '1991-10-20', 'Sr. Alam'),
('0002', 'Alam', '1991-10-20', 'Sr. alam'),
('0004', 'Hasan', '1990-01-01', 'Sr. Hasan'),
('0005', 'Karim', '1988-01-01', 'Sr. Karim'),
('0006', 'Karim', '1988-01-01', 'S Karim'),
('0007', 'Kalam', '1985-01-01', 'Sr. Kalam');
-- DDL and sample data population, end
WITH rs AS
(
SELECT CustomerCode, DOB
, RIGHT(CustomerName, LEN(CustomerName) - c.pos) AS CustomerName
, RIGHT(FatherName, LEN(FatherName) - f.pos) AS FatherName
FROM @tbl
CROSS APPLY (SELECT CHARINDEX(SPACE(1), CustomerName)) AS c(pos)
CROSS APPLY (SELECT CHARINDEX(SPACE(1), FatherName)) AS f(pos)
), cte AS
(
SELECT *
, ROW_NUMBER() OVER (ORDER BY CustomerCode) -
ROW_NUMBER() OVER (PARTITION BY rs.CustomerName, rs.DOB, rs.FatherName ORDER BY rs.CustomerCode) AS bucket
FROM rs
), cte2 AS
(
SELECT CustomerCode, bucket, COUNT(bucket) OVER (PARTITION BY cte.bucket) AS [counter]
FROM cte
GROUP BY CustomerCode, bucket
)
SELECT t.* FROM @tbl AS t
INNER JOIN cte2 ON cte2.CustomerCode = t.CustomerCode
WHERE cte2.counter > 1;
输出
+--------------+--------------+------------+------------+
| CustomerCode | CustomerName | DOB | FatherName |
+--------------+--------------+------------+------------+
| 0001 | Md. Alam | 1991-10-20 | Sr. Alam |
| 0002 | Alam | 1991-10-20 | Sr. alam |
| 0005 | Karim | 1988-01-01 | Sr. Karim |
| 0006 | Karim | 1988-01-01 | S Karim |
+--------------+--------------+------------+------------+
如果我没理解错的话,当客户名称在整体上明显不匹配时,您认为客户 0001 和 0002 是重复的。
我相信您在这里需要某种模糊搜索,它可以匹配 2 个字符串并根据您可以确定的 2 个条目是否重复给出匹配百分比。
此处回答了类似的问题。
SQL Server Fuzzy Search with Percentage of match
这是我的 SQL 服务器 table - 我需要根据客户姓名、出生日期、父亲姓名列查找重复的客户代码:
Customer code Customer Name Date of birth Father Name
-------------------------------------------------------------
0001 Md. Alam 1991-10-20 Sr. Alam
0002 Alam 1991-10-20 Sr. alam
0004 Hasan 1990-01-01 Sr. Hasan
0005 Karim 1988-01-01 Sr. Karim
0006 Karim 1988-01-01 S Karim
0007 Kalam 1985-01-01 Sr. Kalam
输出看起来像:
0001,0002,0005,0006 客户重复,因为客户姓名、出生日期和父亲姓名列相似。
Customer code Customer Name Date of birth Father Name
------------------------------------------------------------
0001 Md. Alam 1991-10-20 Sr. Alam
0002 Alam 1991-10-20 Sr. alam
0005 Karim 1988-01-01 Sr. Karim
0006 Karim 1988-01-01 S Karim
请帮我找出一个有效的方法
select * from customer where code in(
select c1.code from customer c1 inner join customer c2 on c1.dob = c2.dob and
(c1.father = c2.father or c1.name = c2.name)
group by c1.code having count(c1.code) > 1
SELECT t.Customer_Code
,t.Customer_Name
FROM tb2 AS t
WHERE t.Customer_Name IN (
SELECT t.Customer_Name
FROM tb2 AS t
GROUP BY t.Customer_Name
HAVING COUNT(t.Customer_Name) > 1
)
ORDER BY t.Customer_Code ASC
请尝试以下解决方案。
- 1st CTE 正在清理两列 CustomerName 和 FatherName.
- 2nd CTE 正在根据 3 列的组合创建存储桶: CustomerName、DOB 和 FatherName.
- 3rd CTE 正在创建存储桶中行的计数器,即查找重复项。
- 最终
SELECT
正在加入原始 table 并过滤掉 不重复行。
SQL
-- DDL and sample data population, start
DECLARE @tbl TABLE (CustomerCode CHAR(4), CustomerName VARCHAR(20), DOB DATE, FatherName VARCHAR(20));
INSERT INTO @tbl (CustomerCode, CustomerName, DOB, FatherName) VALUES
('0001', 'Md. Alam', '1991-10-20', 'Sr. Alam'),
('0002', 'Alam', '1991-10-20', 'Sr. alam'),
('0004', 'Hasan', '1990-01-01', 'Sr. Hasan'),
('0005', 'Karim', '1988-01-01', 'Sr. Karim'),
('0006', 'Karim', '1988-01-01', 'S Karim'),
('0007', 'Kalam', '1985-01-01', 'Sr. Kalam');
-- DDL and sample data population, end
WITH rs AS
(
SELECT CustomerCode, DOB
, RIGHT(CustomerName, LEN(CustomerName) - c.pos) AS CustomerName
, RIGHT(FatherName, LEN(FatherName) - f.pos) AS FatherName
FROM @tbl
CROSS APPLY (SELECT CHARINDEX(SPACE(1), CustomerName)) AS c(pos)
CROSS APPLY (SELECT CHARINDEX(SPACE(1), FatherName)) AS f(pos)
), cte AS
(
SELECT *
, ROW_NUMBER() OVER (ORDER BY CustomerCode) -
ROW_NUMBER() OVER (PARTITION BY rs.CustomerName, rs.DOB, rs.FatherName ORDER BY rs.CustomerCode) AS bucket
FROM rs
), cte2 AS
(
SELECT CustomerCode, bucket, COUNT(bucket) OVER (PARTITION BY cte.bucket) AS [counter]
FROM cte
GROUP BY CustomerCode, bucket
)
SELECT t.* FROM @tbl AS t
INNER JOIN cte2 ON cte2.CustomerCode = t.CustomerCode
WHERE cte2.counter > 1;
输出
+--------------+--------------+------------+------------+
| CustomerCode | CustomerName | DOB | FatherName |
+--------------+--------------+------------+------------+
| 0001 | Md. Alam | 1991-10-20 | Sr. Alam |
| 0002 | Alam | 1991-10-20 | Sr. alam |
| 0005 | Karim | 1988-01-01 | Sr. Karim |
| 0006 | Karim | 1988-01-01 | S Karim |
+--------------+--------------+------------+------------+
如果我没理解错的话,当客户名称在整体上明显不匹配时,您认为客户 0001 和 0002 是重复的。
我相信您在这里需要某种模糊搜索,它可以匹配 2 个字符串并根据您可以确定的 2 个条目是否重复给出匹配百分比。
此处回答了类似的问题。 SQL Server Fuzzy Search with Percentage of match