如何在 SQL 服务器中找到相似的结果？

Question

我的数据库中有这些项目：

哈利波特与密室
哈利波特与死亡圣器：第 1 部分
哈利波特与死亡圣器：第 2 部分
哈利波特与火焰杯
哈利波特与混血王子
哈利波特与凤凰社
哈利波特与阿兹卡班的囚徒
哈利波特与魔法石
和其他电影...

当用户搜索 'hary poter'（而不是 'harry potter'）时，我如何 return 以上项目？

Answer 1

您可以使用这个查询

create table #test (v varchar(50) )

insert into #test (v) values
 ('Harry Potter and the Chamber of Secrets'       )
,('Harry Potter and the Deathly Hallows: Part 1'  )
,('Harry Potter and the Deathly Hallows: Part 2'  )
,('Harry Potter and the Goblet of Fire'           )
,('Harry Potter and the Half-Blood Prince'        )
,('Harry Potter and the Order of the Phoenix'     )
,('Harry Potter and the Prisoner of Azkaban'      )
,('Harry Potter and the Sorcerer''s Stone'        )


select * from #test 
where PATINDEX('%[Hh]%ar[r]%y [pP]%ot[t]%er%', v)>0

Answer 2

很难在 SQL 服务器中找到真正适合这类事情的东西。模糊匹配确实很难使用，当您需要搜索拼写错误同时又要尽量避免对事物进行错误匹配时。

例如，以下是您可以尝试执行此操作的一种方法：

DECLARE @ TABLE (id INT IDENTITY(1, 1), blah NVARCHAR(255));

INSERT @ VALUES ('Harry Potter and the Chamber of Secrets')
,('Harry Potter and the Deathly Hallows: Part 1')
,('Harry Potter and the Deathly Hallows: Part 2')
,('Harry Potter and the Goblet of Fire')
,('Harry Potter and the Half-Blood Prince')
,('Harry Potter and the Order of the Phoenix')
,('Harry Potter and the Prisoner of Azkaban')
,('Harry Potter and the Sorcerer''s Stone');

DECLARE @myVar NVARCHAR(255) = 'deadly halow'; -- returns 2 matches (both parts of Deathly Hallows)
-- SET @myVar = 'hary poter'; -- returns 8 matches, all of them
-- SET @myVar = 'order'; -- returns 1 match (Order of the Phoenix)
-- SET @myVar = 'phoneix'; -- returns 2 matches (Order of the Phoenix and Half-blood Prince, the latter due to a fuzzy match on 'prince')

WITH CTE AS (
    SELECT id, blah
    FROM @
    UNION ALL
    SELECT 0, @myVar
    )
, CTE2 AS (
    SELECT id
         , blah
         , SUBSTRING(blah, 1, ISNULL(NULLIF(CHARINDEX(' ', blah), 0) - 1, LEN(blah))) individualWord
         , NULLIF(CHARINDEX(' ', blah), 0) cIndex
         , 1 L
    FROM CTE
    UNION ALL 
    SELECT CTE.id
         , CTE.blah
         , SUBSTRING(CTE.blah, cIndex + 1, ISNULL(NULLIF(CHARINDEX(' ', CTE.blah, cIndex + 1), 0) - 1 - cIndex, LEN(CTE.blah)))
         , NULLIF(CHARINDEX(' ', CTE.blah, cIndex + 1), 0)
         , L + 1
    FROM CTE2
    JOIN CTE ON CTE.id = CTE2.id
    WHERE cIndex IS NOT NULL
    )
SELECT blah
FROM (
    SELECT X.blah, ROW_NUMBER() OVER (PARTITION BY X.ID, Y.L ORDER BY (SELECT NULL)) RN, Y.wordCount
    FROM CTE2 X
    JOIN (SELECT *, COUNT(*) OVER() wordCount FROM CTE2 WHERE id = 0) Y ON DIFFERENCE(X.individualWord, Y.individualWord) >= 3 AND X.id <> 0) T
WHERE RN = 1
GROUP BY blah
HAVING COUNT(*) = MAX(wordCount);

这会拆分搜索词中的每个单词，拆分标题中的每个单词，然后使用 DIFFERENCE() 函数比较值的 SOUNDEX() 并告诉您如何他们相距甚远。例如SOUNDEX('Halow') 是 'H400' 而 SOUNDEX('Hallows') 是 'H420' - 这里的区别是 3（因为 H、4 和其中一个零匹配）。完美匹配的差值为 4，势均力敌的差值一般在 3 以上。

不幸的是，因为您需要检查接近的匹配项，所以有时您会得到一些误报。例如，我使用 'phoneix' 作为输入对其进行了测试，并且由于 'prince' 和 'phoenix' 之间的模糊匹配而在 'Half-blood Prince' 上得到了匹配。我敢肯定有一些方法可以对此进行改进，但是像这样的事情应该作为您要实现的目标的基础。

如何在 SQL 服务器中找到相似的结果？

How to find similar results in SQL Server?

sql-server

full-text-search

difference