Damerau-Levenshtein 算法不适用于短字符串

Damerau-Levenshtein algorithm isn't working on short strings

我有一个 for 循环,它接受用户的输入和我字典中的一个键,并将它们传递给 Damerau-Levenshtein 函数,并根据距离,用字典键覆盖用户的输入(for 循环是循环遍历每个字典键)。这对于大于三个字符的字符串来说工作得很好,但如果字符串是三个或更少的字符,算法 returns 使用错误的键。这是 for 循环:

    1950        For j = 0 To dict.Count - 1
    1960            distance = DamerauLevenshtein(SplitStr(i), dict.Keys(j))
    1970            'MsgBox dict.Keys(j) & vbCrLf & distance ' used for debugging
    1980            If distance < 4 Then
    1990                If distance < leastDist Then
    2000                    leastDist = distance
    2010                    SplitStr(i) = dict.Keys(j)
    2020                End If
    2030            End If
    2040        Next
    2050        MsgBox "The distance is: " & leastDist & vbCrLf & "The entered text was " & tempStr & vbCrLf & "The replaced word is " & SplitStr(i)

SplitStr(i) 保存来自拆分函数的用户输入。我随意选了4个好距离

我从a bytes.com forum post那里偷了算法。算法如下:

Function DamerauLevenshtein(str1, str2, Optional intSize = 256)
  Dim intTotalLen, arrDistance, intLen1, intLen2, i, j, arrStr1, arrStr2, arrDA, intMini
  Dim intDB, intI1, intJ1, intD

  str1 = UCase(str1)
  str2 = UCase(str2)
  intLen1 = Len(str1)
  intLen2 = Len(str2)
  intTotalLen = intLen1 + intLen2
  ReDim arrStr1(intLen1)
  ReDim arrStr2(intLen2)
  ReDim arrDA(intSize)
  ReDim arrDistance(intLen1 + 2, intLen2 + 2)
  arrDistance(0, 0) = intTotalLen

  For i = 0 To intSize - 1
      arrDA(i) = 0
  Next

  For i = 0 To intLen1
      arrDistance(i + 1, 1) = i
      arrDistance(i + 1, 0) = intTotalLen
  Next

  For i = 1 To intLen1
      arrStr1(i - 1) = Asc(Mid(str1, i, 1))
  Next

  For j = 0 To intLen2
      arrDistance(1, j + 1) = j
      arrDistance(0, j + 1) = intTotalLen
  Next

  For j = 1 To intLen2
      arrStr2(j - 1) = Asc(Mid(str2, j, 1))
  Next

  For i = 1 To intLen1
      intDB = 0

      For j = 1 To intLen2
          intI1 = arrDA(arrStr2(j - 1))
          intJ1 = intDB

          If arrStr1(i - 1) = arrStr2(j - 1) Then
              intD = 0
          Else
              intD = 1
          End If

          If intD = 0 Then intDB = j

          intMini = arrDistance(i, j) + intD
          If intMini > arrDistance(i + 1, j) + 1 Then intMini = arrDistance(i + 1, j) + 1
          If intMini > arrDistance(i, j + 1) + 1 Then intMini = arrDistance(i, j + 1) + 1
          If intMini > arrDistance(intI1, intJ1) + i - intI1 + j - intJ1 - 1 Then intMini = arrDistance(intI1, intJ1) + i - intI1 + j - intJ1 - 1

          arrDistance(i + 1, j + 1) = intMini
      Next

      arrDA(arrStr1(i - 1)) = i
  Next

  DamerauLevenshtein = arrDistance(intLen1 + 1, intLen2 + 1)
End Function

如果我输入 "Cire" 算法正确 returns "CORE".

"Raman" returns "REMAN" "Cosnigned" returns "已发货

但是,"Now"应该return"New"但是return"OCM".

"New"也returns "OCM"(所以距离应该是0,但是是2.)

"FP"应该是"FP"但是return是"OCM",距离是2

"DPF"应该是"DPF"但是return是"OCM",距离是2

我刚刚了解了该算法,所以我确信我遗漏了一些重要的东西,但我就是看不到它。想法?

我明白了。经过大量搜索,我发现 post 说编辑距离通常为 2。(他们没有具体说明为什么 2 很常见)

我将 if 语句从 4 切换为 2,现在所有问题术语都得到了更正。