where 子句中的最佳搜索字符串
Optimal search string in the where clause
想要在 WHERE
子句中使用 PATINDEX 和 SOUNDEX 或任何最佳方式搜索字符串。
我有以下 table 和一些示例数据,使用 PATINDEX
和 SOUNDEX
.
搜索给定的字符串
create table tbl_pat_soundex
(
col_str varchar(max)
);
insert into tbl_pat_soundex values('Smith A Steve');
insert into tbl_pat_soundex values('Steve A Smyth');
insert into tbl_pat_soundex values('A Smeeth Stive');
insert into tbl_pat_soundex values('Steve Smith A');
insert into tbl_pat_soundex values('Smit Steve A');
注意:我在 table 中有 100 Millions
条记录要搜索。
要搜索的字符串:- 'Smith A Steve'
SELECT col_str
FROM tbl_pat_soundex
WHERE PATINDEX('%Smith%',col_str) >= 1 AND PATINDEX('%A%',col_str) >= 1 AND PATINDEX('%Steve%',col_str) >= 1
获取输出:
col_str
--------------
Smith A Steve
Steve Smith A
预期输出:
col_str
----------------
Smith A Steve
Steve A Smyth
A Smeeth Stive
Steve Smith A
Smit Steve A
尝试过:
1:
SELECT col_str
FROM tbl_pat_soundex
WHERE PATINDEX('%Smith%',col_str) >= 1 AND
PATINDEX('%A%',col_str) >= 1 AND
PATINDEX('%Steve%',col_str) >= 1
2:
SELECT col_str
FROM tbl_pat_soundex
WHERE PATINDEX('%'+SOUNDEX('Smith')+'%',SOUNDEX(col_str)) >= 1 AND
PATINDEX('%'+SOUNDEX('A')+'%',SOUNDEX(col_str)) >= 1 AND
PATINDEX('%'+SOUNDEX('Steve')+'%',SOUNDEX(col_str)) >= 1
3:
SELECT col_str
FROM tbl_pat_soundex
WHERE DIFFERENCE('Smith',col_str) = 4 AND
DIFFERENCE('A',col_str) =4 AND
DIFFERENCE('Steve',col_str) = 4
4:
--Following was taking huge time(was kept running more than 20 minutes) to execute.
SELECT DISTINCT col_str
FROM tbl_pat_soundex [a]
CROSS APPLY SplitString([a].[col_str], ' ') [b]
WHERE DIFFERENCE([b].Item,'Smith') >= 1 AND
DIFFERENCE([b].Item,'A') >= 1 AND
DIFFERENCE([b].Item,'Steve') >= 1
在我看来,您应该尝试使用动态 SQL。
例如,你有一个 table:
create table tbl_pat_soundex
(
id int,
col_str varchar(max)
)
并且您有以下聚集索引或任何其他索引(table 超过 1 亿行应该有一些索引):
CREATE NONCLUSTERED INDEX myIndex ON dbo.tbl_pat_soundex(id) INCLUDE (col_str)*/
因此尝试根据您的逻辑创建以下动态 SQL 查询并执行它。愿望结果应如下所示:
DECLARE @statement NVARCHAR(4000)
SET @statement = N'
SELECT col_str
FROM tbl_pat_soundex
WHERE col_str like '%Smith%' AND id > 0
UNION ALL
SELECT col_str
FROM tbl_pat_soundex
WHERE col_str like '%Steve%' AND id > 0
UNION ALL
SELECT col_str
FROM tbl_pat_soundex
WHERE
PATINDEX('%Smith%',col_str) >= 1 AND PATINDEX('%A%',col_str) >= 1 AND
PATINDEX('%Steve%',col_str) >= 1
AND id > 0'
基本上,我们所做的是创建单个搜索查询,该查询将进行索引搜索,然后合并所有结果。
此查询将进行索引查找,因为我们使用谓词 id > 0
(假设所有 id 都大于 0
或者您可以编写自己的负数):
SELECT col_str
FROM tbl_pat_soundex
WHERE col_str like '%Smith%' AND id > 0
这么多行,我唯一能给你的提示是:改变设计。每个名称部分应位于单独的列中...
以下会起作用,但我保证它会很慢...
--建立测试数据库
USE master;
GO
CREATE DATABASE shnugo;
GO
USE shnugo;
GO
--你的table,我加了一个ID-column
create table tbl_pat_soundex
(
ID INT IDENTITY --needed to distinguish rows
,col_str varchar(max)
);
GO
--一个函数,它将 return 一个空格分隔的字符串作为按字母顺序排列的不同 soundex 值的列表,由 /
分隔:"Smith A Steve" 返回作为 /A000/S310/S530/
CREATE FUNCTION dbo.ComputeSoundex(@str VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
DECLARE @tmpXML XML=CAST('<x>' + REPLACE((SELECT @str AS [*] FOR XML PATH('')),' ','</x><x>') + '</x>' AS XML);
RETURN (SELECT DISTINCT '/' + SOUNDEX(x.value('text()[1]','varchar(max)')) AS [se]
FROM @tmpXML.nodes('/x[text()]') A(x)
ORDER BY se
FOR XML PATH(''),TYPE).value('.','nvarchar(max)') + '/';
END
GO
--添加一列以永久存储计算的soundex-链
ALTER TABLE tbl_pat_soundex ADD SortedSoundExPattern VARCHAR(MAX);
GO
--我们需要一个触发器来在任何插入或更新
时维护计算的soundex链
CREATE TRIGGER RefreshComputeSoundex ON tbl_pat_soundex
FOR INSERT,UPDATE
AS
BEGIN
UPDATE s SET SortedSoundExPattern=dbo.ComputeSoundex(i.col_str)
FROM tbl_pat_soundex s
INNER JOIN inserted i ON s.ID=i.ID;
END
GO
--测试数据
insert into tbl_pat_soundex(col_str) values
('Smith A Steve')
,('Steve A Smyth')
,('A Smeeth Stive')
,('Steve Smith A')
,('Smit Steve A')
,('Smit Steve') --no A
,('Smit A') --no Steve
,('Smit Smith Robert Peter A') --add noise
,('Shnugo'); --something else entirely
--检查中间结果
SELECT *
FROM tbl_pat_soundex
/*
+----+---------------------------+-----------------------+
| ID | col_str | SortedSoundExPattern |
+----+---------------------------+-----------------------+
| 1 | Smith A Steve | /A000/S310/S530/ |
+----+---------------------------+-----------------------+
| 2 | Steve A Smyth | /A000/S310/S530/ |
+----+---------------------------+-----------------------+
| 3 | A Smeeth Stive | /A000/S310/S530/ |
+----+---------------------------+-----------------------+
| 4 | Steve Smith A | /A000/S310/S530/ |
+----+---------------------------+-----------------------+
| 5 | Smit Steve A | /A000/S310/S530/ |
+----+---------------------------+-----------------------+
| 6 | Smit Steve | /S310/S530/ |
+----+---------------------------+-----------------------+
| 7 | Smit A | /A000/S530/ |
+----+---------------------------+-----------------------+
| 8 | Smit Smith Robert Peter A | /A000/P360/R163/S530/ |
+----+---------------------------+-----------------------+
| 9 | Shnugo | /S520/ |
+----+---------------------------+-----------------------+
*/
--现在我们可以开始搜索了:
DECLARE @StringToSearch VARCHAR(MAX)=' A Steve';
WITH SplittedSearchString AS
(
SELECT soundexCode.value('text()[1]','nvarchar(max)') AS SoundExCode
FROM (SELECT CAST('<x>' + REPLACE(dbo.ComputeSoundex(@StringToSearch),'/','</x><x>') + '</x>' AS XML)) A(x)
CROSS APPLY x.nodes('/x[text()]') B(soundexCode)
)
SELECT a.ID,col_str
FROM tbl_pat_soundex a
INNER JOIN SplittedSearchString s On SortedSoundExPattern LIKE '%/' + s.SoundExCode + '/%'
GROUP BY ID,col_str
HAVING COUNT(ID)=(SELECT COUNT(*) FROM SplittedSearchString)
ORDER BY ID
GO
--清理
USE master;
GO
DROP DATABASE shnugo;
简短说明
这是它的工作原理:
- cte 将使用相同的函数来return一个soundex-所有输入片段的链
- 然后查询将
INNER JOIN
通过 LIKE
测试 -- 这将是 sloooooow...
- 最后的检查是命中数是否与片段数相同。
最后一个提示:如果你想搜索完全匹配,但你想包含不同的文字,你可以直接比较两个字符串。您甚至可以在新列 SortedSoundExPattern
上放置一个索引。由于创建方式的原因,各种 "Steven A Smith"、"Steeven a Smit" 甚至 "Smith Steven A" 等不同顺序都会产生完全相同的模式。
想要在 WHERE
子句中使用 PATINDEX 和 SOUNDEX 或任何最佳方式搜索字符串。
我有以下 table 和一些示例数据,使用 PATINDEX
和 SOUNDEX
.
create table tbl_pat_soundex
(
col_str varchar(max)
);
insert into tbl_pat_soundex values('Smith A Steve');
insert into tbl_pat_soundex values('Steve A Smyth');
insert into tbl_pat_soundex values('A Smeeth Stive');
insert into tbl_pat_soundex values('Steve Smith A');
insert into tbl_pat_soundex values('Smit Steve A');
注意:我在 table 中有 100 Millions
条记录要搜索。
要搜索的字符串:- 'Smith A Steve'
SELECT col_str
FROM tbl_pat_soundex
WHERE PATINDEX('%Smith%',col_str) >= 1 AND PATINDEX('%A%',col_str) >= 1 AND PATINDEX('%Steve%',col_str) >= 1
获取输出:
col_str
--------------
Smith A Steve
Steve Smith A
预期输出:
col_str
----------------
Smith A Steve
Steve A Smyth
A Smeeth Stive
Steve Smith A
Smit Steve A
尝试过:
1:
SELECT col_str
FROM tbl_pat_soundex
WHERE PATINDEX('%Smith%',col_str) >= 1 AND
PATINDEX('%A%',col_str) >= 1 AND
PATINDEX('%Steve%',col_str) >= 1
2:
SELECT col_str
FROM tbl_pat_soundex
WHERE PATINDEX('%'+SOUNDEX('Smith')+'%',SOUNDEX(col_str)) >= 1 AND
PATINDEX('%'+SOUNDEX('A')+'%',SOUNDEX(col_str)) >= 1 AND
PATINDEX('%'+SOUNDEX('Steve')+'%',SOUNDEX(col_str)) >= 1
3:
SELECT col_str
FROM tbl_pat_soundex
WHERE DIFFERENCE('Smith',col_str) = 4 AND
DIFFERENCE('A',col_str) =4 AND
DIFFERENCE('Steve',col_str) = 4
4:
--Following was taking huge time(was kept running more than 20 minutes) to execute.
SELECT DISTINCT col_str
FROM tbl_pat_soundex [a]
CROSS APPLY SplitString([a].[col_str], ' ') [b]
WHERE DIFFERENCE([b].Item,'Smith') >= 1 AND
DIFFERENCE([b].Item,'A') >= 1 AND
DIFFERENCE([b].Item,'Steve') >= 1
在我看来,您应该尝试使用动态 SQL。
例如,你有一个 table:
create table tbl_pat_soundex
(
id int,
col_str varchar(max)
)
并且您有以下聚集索引或任何其他索引(table 超过 1 亿行应该有一些索引):
CREATE NONCLUSTERED INDEX myIndex ON dbo.tbl_pat_soundex(id) INCLUDE (col_str)*/
因此尝试根据您的逻辑创建以下动态 SQL 查询并执行它。愿望结果应如下所示:
DECLARE @statement NVARCHAR(4000)
SET @statement = N'
SELECT col_str
FROM tbl_pat_soundex
WHERE col_str like '%Smith%' AND id > 0
UNION ALL
SELECT col_str
FROM tbl_pat_soundex
WHERE col_str like '%Steve%' AND id > 0
UNION ALL
SELECT col_str
FROM tbl_pat_soundex
WHERE
PATINDEX('%Smith%',col_str) >= 1 AND PATINDEX('%A%',col_str) >= 1 AND
PATINDEX('%Steve%',col_str) >= 1
AND id > 0'
基本上,我们所做的是创建单个搜索查询,该查询将进行索引搜索,然后合并所有结果。
此查询将进行索引查找,因为我们使用谓词 id > 0
(假设所有 id 都大于 0
或者您可以编写自己的负数):
SELECT col_str
FROM tbl_pat_soundex
WHERE col_str like '%Smith%' AND id > 0
这么多行,我唯一能给你的提示是:改变设计。每个名称部分应位于单独的列中...
以下会起作用,但我保证它会很慢...
--建立测试数据库
USE master;
GO
CREATE DATABASE shnugo;
GO
USE shnugo;
GO
--你的table,我加了一个ID-column
create table tbl_pat_soundex
(
ID INT IDENTITY --needed to distinguish rows
,col_str varchar(max)
);
GO
--一个函数,它将 return 一个空格分隔的字符串作为按字母顺序排列的不同 soundex 值的列表,由 /
分隔:"Smith A Steve" 返回作为 /A000/S310/S530/
CREATE FUNCTION dbo.ComputeSoundex(@str VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
DECLARE @tmpXML XML=CAST('<x>' + REPLACE((SELECT @str AS [*] FOR XML PATH('')),' ','</x><x>') + '</x>' AS XML);
RETURN (SELECT DISTINCT '/' + SOUNDEX(x.value('text()[1]','varchar(max)')) AS [se]
FROM @tmpXML.nodes('/x[text()]') A(x)
ORDER BY se
FOR XML PATH(''),TYPE).value('.','nvarchar(max)') + '/';
END
GO
--添加一列以永久存储计算的soundex-链
ALTER TABLE tbl_pat_soundex ADD SortedSoundExPattern VARCHAR(MAX);
GO
--我们需要一个触发器来在任何插入或更新
时维护计算的soundex链CREATE TRIGGER RefreshComputeSoundex ON tbl_pat_soundex
FOR INSERT,UPDATE
AS
BEGIN
UPDATE s SET SortedSoundExPattern=dbo.ComputeSoundex(i.col_str)
FROM tbl_pat_soundex s
INNER JOIN inserted i ON s.ID=i.ID;
END
GO
--测试数据
insert into tbl_pat_soundex(col_str) values
('Smith A Steve')
,('Steve A Smyth')
,('A Smeeth Stive')
,('Steve Smith A')
,('Smit Steve A')
,('Smit Steve') --no A
,('Smit A') --no Steve
,('Smit Smith Robert Peter A') --add noise
,('Shnugo'); --something else entirely
--检查中间结果
SELECT *
FROM tbl_pat_soundex
/*
+----+---------------------------+-----------------------+
| ID | col_str | SortedSoundExPattern |
+----+---------------------------+-----------------------+
| 1 | Smith A Steve | /A000/S310/S530/ |
+----+---------------------------+-----------------------+
| 2 | Steve A Smyth | /A000/S310/S530/ |
+----+---------------------------+-----------------------+
| 3 | A Smeeth Stive | /A000/S310/S530/ |
+----+---------------------------+-----------------------+
| 4 | Steve Smith A | /A000/S310/S530/ |
+----+---------------------------+-----------------------+
| 5 | Smit Steve A | /A000/S310/S530/ |
+----+---------------------------+-----------------------+
| 6 | Smit Steve | /S310/S530/ |
+----+---------------------------+-----------------------+
| 7 | Smit A | /A000/S530/ |
+----+---------------------------+-----------------------+
| 8 | Smit Smith Robert Peter A | /A000/P360/R163/S530/ |
+----+---------------------------+-----------------------+
| 9 | Shnugo | /S520/ |
+----+---------------------------+-----------------------+
*/
--现在我们可以开始搜索了:
DECLARE @StringToSearch VARCHAR(MAX)=' A Steve';
WITH SplittedSearchString AS
(
SELECT soundexCode.value('text()[1]','nvarchar(max)') AS SoundExCode
FROM (SELECT CAST('<x>' + REPLACE(dbo.ComputeSoundex(@StringToSearch),'/','</x><x>') + '</x>' AS XML)) A(x)
CROSS APPLY x.nodes('/x[text()]') B(soundexCode)
)
SELECT a.ID,col_str
FROM tbl_pat_soundex a
INNER JOIN SplittedSearchString s On SortedSoundExPattern LIKE '%/' + s.SoundExCode + '/%'
GROUP BY ID,col_str
HAVING COUNT(ID)=(SELECT COUNT(*) FROM SplittedSearchString)
ORDER BY ID
GO
--清理
USE master;
GO
DROP DATABASE shnugo;
简短说明
这是它的工作原理:
- cte 将使用相同的函数来return一个soundex-所有输入片段的链
- 然后查询将
INNER JOIN
通过LIKE
测试 -- 这将是 sloooooow... - 最后的检查是命中数是否与片段数相同。
最后一个提示:如果你想搜索完全匹配,但你想包含不同的文字,你可以直接比较两个字符串。您甚至可以在新列 SortedSoundExPattern
上放置一个索引。由于创建方式的原因,各种 "Steven A Smith"、"Steeven a Smit" 甚至 "Smith Steven A" 等不同顺序都会产生完全相同的模式。