Damerau-Levenshtein 算法不适用于短字符串
Damerau-Levenshtein algorithm isn't working on short strings
我有一个 for 循环,它接受用户的输入和我字典中的一个键,并将它们传递给 Damerau-Levenshtein 函数,并根据距离,用字典键覆盖用户的输入(for 循环是循环遍历每个字典键)。这对于大于三个字符的字符串来说工作得很好,但如果字符串是三个或更少的字符,算法 returns 使用错误的键。这是 for 循环:
1950 For j = 0 To dict.Count - 1
1960 distance = DamerauLevenshtein(SplitStr(i), dict.Keys(j))
1970 'MsgBox dict.Keys(j) & vbCrLf & distance ' used for debugging
1980 If distance < 4 Then
1990 If distance < leastDist Then
2000 leastDist = distance
2010 SplitStr(i) = dict.Keys(j)
2020 End If
2030 End If
2040 Next
2050 MsgBox "The distance is: " & leastDist & vbCrLf & "The entered text was " & tempStr & vbCrLf & "The replaced word is " & SplitStr(i)
SplitStr(i) 保存来自拆分函数的用户输入。我随意选了4个好距离
我从a bytes.com forum post那里偷了算法。算法如下:
Function DamerauLevenshtein(str1, str2, Optional intSize = 256)
Dim intTotalLen, arrDistance, intLen1, intLen2, i, j, arrStr1, arrStr2, arrDA, intMini
Dim intDB, intI1, intJ1, intD
str1 = UCase(str1)
str2 = UCase(str2)
intLen1 = Len(str1)
intLen2 = Len(str2)
intTotalLen = intLen1 + intLen2
ReDim arrStr1(intLen1)
ReDim arrStr2(intLen2)
ReDim arrDA(intSize)
ReDim arrDistance(intLen1 + 2, intLen2 + 2)
arrDistance(0, 0) = intTotalLen
For i = 0 To intSize - 1
arrDA(i) = 0
Next
For i = 0 To intLen1
arrDistance(i + 1, 1) = i
arrDistance(i + 1, 0) = intTotalLen
Next
For i = 1 To intLen1
arrStr1(i - 1) = Asc(Mid(str1, i, 1))
Next
For j = 0 To intLen2
arrDistance(1, j + 1) = j
arrDistance(0, j + 1) = intTotalLen
Next
For j = 1 To intLen2
arrStr2(j - 1) = Asc(Mid(str2, j, 1))
Next
For i = 1 To intLen1
intDB = 0
For j = 1 To intLen2
intI1 = arrDA(arrStr2(j - 1))
intJ1 = intDB
If arrStr1(i - 1) = arrStr2(j - 1) Then
intD = 0
Else
intD = 1
End If
If intD = 0 Then intDB = j
intMini = arrDistance(i, j) + intD
If intMini > arrDistance(i + 1, j) + 1 Then intMini = arrDistance(i + 1, j) + 1
If intMini > arrDistance(i, j + 1) + 1 Then intMini = arrDistance(i, j + 1) + 1
If intMini > arrDistance(intI1, intJ1) + i - intI1 + j - intJ1 - 1 Then intMini = arrDistance(intI1, intJ1) + i - intI1 + j - intJ1 - 1
arrDistance(i + 1, j + 1) = intMini
Next
arrDA(arrStr1(i - 1)) = i
Next
DamerauLevenshtein = arrDistance(intLen1 + 1, intLen2 + 1)
End Function
如果我输入 "Cire" 算法正确 returns "CORE".
"Raman" returns "REMAN"
"Cosnigned" returns "已发货
但是,"Now"应该return"New"但是return"OCM".
"New"也returns "OCM"(所以距离应该是0,但是是2.)
"FP"应该是"FP"但是return是"OCM",距离是2
"DPF"应该是"DPF"但是return是"OCM",距离是2
我刚刚了解了该算法,所以我确信我遗漏了一些重要的东西,但我就是看不到它。想法?
我明白了。经过大量搜索,我发现 post 说编辑距离通常为 2。(他们没有具体说明为什么 2 很常见)
我将 if 语句从 4 切换为 2,现在所有问题术语都得到了更正。
我有一个 for 循环,它接受用户的输入和我字典中的一个键,并将它们传递给 Damerau-Levenshtein 函数,并根据距离,用字典键覆盖用户的输入(for 循环是循环遍历每个字典键)。这对于大于三个字符的字符串来说工作得很好,但如果字符串是三个或更少的字符,算法 returns 使用错误的键。这是 for 循环:
1950 For j = 0 To dict.Count - 1
1960 distance = DamerauLevenshtein(SplitStr(i), dict.Keys(j))
1970 'MsgBox dict.Keys(j) & vbCrLf & distance ' used for debugging
1980 If distance < 4 Then
1990 If distance < leastDist Then
2000 leastDist = distance
2010 SplitStr(i) = dict.Keys(j)
2020 End If
2030 End If
2040 Next
2050 MsgBox "The distance is: " & leastDist & vbCrLf & "The entered text was " & tempStr & vbCrLf & "The replaced word is " & SplitStr(i)
SplitStr(i) 保存来自拆分函数的用户输入。我随意选了4个好距离
我从a bytes.com forum post那里偷了算法。算法如下:
Function DamerauLevenshtein(str1, str2, Optional intSize = 256)
Dim intTotalLen, arrDistance, intLen1, intLen2, i, j, arrStr1, arrStr2, arrDA, intMini
Dim intDB, intI1, intJ1, intD
str1 = UCase(str1)
str2 = UCase(str2)
intLen1 = Len(str1)
intLen2 = Len(str2)
intTotalLen = intLen1 + intLen2
ReDim arrStr1(intLen1)
ReDim arrStr2(intLen2)
ReDim arrDA(intSize)
ReDim arrDistance(intLen1 + 2, intLen2 + 2)
arrDistance(0, 0) = intTotalLen
For i = 0 To intSize - 1
arrDA(i) = 0
Next
For i = 0 To intLen1
arrDistance(i + 1, 1) = i
arrDistance(i + 1, 0) = intTotalLen
Next
For i = 1 To intLen1
arrStr1(i - 1) = Asc(Mid(str1, i, 1))
Next
For j = 0 To intLen2
arrDistance(1, j + 1) = j
arrDistance(0, j + 1) = intTotalLen
Next
For j = 1 To intLen2
arrStr2(j - 1) = Asc(Mid(str2, j, 1))
Next
For i = 1 To intLen1
intDB = 0
For j = 1 To intLen2
intI1 = arrDA(arrStr2(j - 1))
intJ1 = intDB
If arrStr1(i - 1) = arrStr2(j - 1) Then
intD = 0
Else
intD = 1
End If
If intD = 0 Then intDB = j
intMini = arrDistance(i, j) + intD
If intMini > arrDistance(i + 1, j) + 1 Then intMini = arrDistance(i + 1, j) + 1
If intMini > arrDistance(i, j + 1) + 1 Then intMini = arrDistance(i, j + 1) + 1
If intMini > arrDistance(intI1, intJ1) + i - intI1 + j - intJ1 - 1 Then intMini = arrDistance(intI1, intJ1) + i - intI1 + j - intJ1 - 1
arrDistance(i + 1, j + 1) = intMini
Next
arrDA(arrStr1(i - 1)) = i
Next
DamerauLevenshtein = arrDistance(intLen1 + 1, intLen2 + 1)
End Function
如果我输入 "Cire" 算法正确 returns "CORE".
"Raman" returns "REMAN" "Cosnigned" returns "已发货
但是,"Now"应该return"New"但是return"OCM".
"New"也returns "OCM"(所以距离应该是0,但是是2.)
"FP"应该是"FP"但是return是"OCM",距离是2
"DPF"应该是"DPF"但是return是"OCM",距离是2
我刚刚了解了该算法,所以我确信我遗漏了一些重要的东西,但我就是看不到它。想法?
我明白了。经过大量搜索,我发现 post 说编辑距离通常为 2。(他们没有具体说明为什么 2 很常见)
我将 if 语句从 4 切换为 2,现在所有问题术语都得到了更正。