模糊字符串匹配优化(不检查某些词)——ExcelVBA函数

Fuzzy string matching optimization (not checking certain words) - Excel VBA function

我在 Excel 中有一个函数可以计算两个字符串之间的 Levenshtein 距离(将一个字符串转换为另一个字符串所需的插入、删除、and/or 替换的次数)。我将其用作我正在从事的项目的一部分,该项目涉及 "fuzzy string matching."

下面您将看到 LevenshteinDistance 函数和 valuePhrase 函数的代码。后者的存在是为了执行我的电子表格中的功能。这是我从 this thread.

中读到的内容
'Calculate the Levenshtein Distance between two strings (the number of insertions,
'deletions, and substitutions needed to transform the first string into the second)`

Public Function LevenshteinDistance(ByRef S1 As String, ByVal S2 As String) As Long
    Dim L1 As Long, L2 As Long, D() As Long 'Length of input strings and distance matrix
    Dim i As Long, j As Long, cost As Long 'loop counters and cost of 
        'substitution for current letter
    Dim cI As Long, cD As Long, cS As Long 'cost of next Insertion, Deletion and 
        Substitution

    L1 = Len(S1): L2 = Len(S2)
    ReDim D(0 To L1, 0 To L2)
    For i = 0 To L1: D(i, 0) = i: Next i
    For j = 0 To L2: D(0, j) = j: Next j

    For j = 1 To L2
        For i = 1 To L1
            cost = Abs(StrComp(Mid$(S1, i, 1), Mid$(S2, j, 1), vbTextCompare))
            cI = D(i - 1, j) + 1
            cD = D(i, j - 1) + 1
            cS = D(i - 1, j - 1) + cost
            If cI <= cD Then 'Insertion or Substitution
                If cI <= cS Then D(i, j) = cI Else D(i, j) = cS
            Else 'Deletion or Substitution
                If cD <= cS Then D(i, j) = cD Else D(i, j) = cS
            End If
        Next i
    Next j
    LevenshteinDistance = D(L1, L2)

End Function

Public Function valuePhrase#(ByRef S1$, ByRef S2$)

    valuePhrase = LevenshteinDistance(S1, S2)

End Function

我在我的一张工作表中的 table 中执行此 valuePhrase 函数,其中列和行 header 是保险公司的名称。理想情况下,任何给定行中的最小数字(最短 Levenshtein 距离)应对应于列 header,其中保险公司名称在 table 中与保险公司名称最匹配header.

我的问题是,我正在尝试在相关字符串是保险公司名称的情况下进行计算。考虑到这一点,上面的代码严格计算 Levenshtein 距离,并不是专门针对这种情况量身定制的。为了说明这一点,举一个简单的例子来说明为什么这可能是一个问题,因为如果两家保险公司名称都共享单词 "insurance" 和 "company"(您可能会expect, is common), 即使保险公司在其独特的词方面有完全不同的名称。所以,我可能希望函数在比较两个字符串时忽略这些词。

我是 VBA 的新手。有没有办法在代码中实现此修复?作为次要问题,比较保险公司的名称是否会产生其他独特的问题?感谢您的帮助!

你的整个问题可以用"How do I use the replace function in VBA?"代替。一般来说,问题中的算法看起来很有趣,因此我已经为你做了这个。只需在函数的 Array() 中添加任何内容,它就会起作用(只需将数组中的值写成小写):

Public Function removeSpecificWords(s As String) As String

 Dim arr     As Variant
 Dim cnt     As Long

 arr = Array("insurance", "company", "firma", "firm", "holding")
 removeSpecificWords = s

 For cnt = LBound(arr) To UBound(arr)
  removeSpecificWords = Replace(LCase(removeSpecificWords), LCase(arr(cnt)), vbNullString)
 Next cnt

End Function

Public Sub TestMe()

    Debug.Print removeSpecificWords("InsHolding")
    Debug.Print removeSpecificWords("InsuranceInsHoldingStar")

End Sub

你的情况:

    S1 = removeSpecificWords(S1)
    S2 = removeSpecificWords(S2)
    valuePhrase = LevenshteinDistance(S1, S2)

当我在尝试删除重复地址时遇到类似问题时,我以另一种方式解决了这个问题并使用了最长公共子串。

Function DetermineLCS(source As String, target As String) As Double
    Dim results() As Long
    Dim sourceLen As Long
    Dim targetLen As Long
    Dim counter1 As Long
    Dim counter2 As Long

    sourceLen = Len(source)
    targetLen = Len(target)

    ReDim results(0 To sourceLen, 0 To targetLen)

    For counter1 = 1 To sourceLen
        For counter2 = 1 To targetLen
            If Mid$(source, counter1, 1) = Mid$(target, counter2, 1) Then
                results(counter1, counter2) = results(counter1 - 1, counter2 - 1) + 1
            Else
                results(counter1, counter2) = WorksheetFunction.Max(results(counter1, _
                        counter2 - 1), results(counter1 - 1, counter2))
            End If
        Next counter2
    Next counter1

    'return the percentage of the LCS to the length of the source string
    DetermineLCS = results(sourceLen, targetLen) / sourceLen
End Function

对于地址,我发现大约 80% 的匹配会让我接近 100% 的匹配。与保险机构名称(我曾经在这个行业工作,所以我知道你面临的问题),我可能会建议一个 90% 的目标,甚至是 Levenshtein 距离和 LCS 的混合,最小化前者,同时最大化后者。