运行 vb 代码计算相似度时定义首字母缩略词
Define acronyms when running vb code calculating similarity
我正在使用 excel 中的以下 vb 代码来计算 A 列和 B 列之间的相似度。它运行良好。
我的下一步是定义首字母缩略词,这样计算出的相似度就不会受到影响。即:如果我在 A 列 "ABC LLC" 和 B 列 "ABC limited liability company" 中有,当前 vb 代码将 return 两列不是很相似。但是,我希望通过定义 "LLC" 和 "Limited Liability Company" 确实是同一件事,使它们 return 100% 相似。我可以做什么以及可以将其放在代码中的什么位置来完成此操作?谢谢!
免责声明 - 是的,我知道有加载项可以执行此操作。但是,我的数据集太大而无法使用它们。
Public Function Similarity(ByVal String1 As String, _
ByVal String2 As String, _
Optional ByRef RetMatch As String, _
Optional min_match = 1) As Single
'Returns percentile of similarity between 2 strings (ignores case)
'"RetMatch" returns the characters that match(in order)
'"min_match" specifies minimum number af char's in a row to match
Dim b1() As Byte, b2() As Byte
Dim lngLen1 As Long, lngLen2 As Long
Dim lngResult As Long
If UCase(String1) = UCase(String2) Then '..Exactly the same
Similarity = 1
Else '..one string is empty
lngLen1 = Len(String1)
lngLen2 = Len(String2)
If (lngLen1 = 0) Or (lngLen2 = 0) Then
Similarity = 0
Else '..otherwise find similarity
b1() = StrConv(UCase(String1), vbFromUnicode)
b2() = StrConv(UCase(String2), vbFromUnicode)
lngResult = Similarity_sub(0, lngLen1 - 1, _
0, lngLen2 - 1, _
b1, b2, _
String1, _
RetMatch, _
min_match)
Erase b1
Erase b2
If lngLen1 >= lngLen2 Then
Similarity = lngResult / lngLen1
Else
Similarity = lngResult / lngLen2
End If
End If
End If
End Function
Private Function Similarity_sub(ByVal start1 As Long, ByVal end1 As Long, _
ByVal start2 As Long, ByVal end2 As Long, _
ByRef b1() As Byte, ByRef b2() As Byte, _
ByVal FirstString As String, _
ByRef RetMatch As String, _
ByVal min_match As Long, _
Optional recur_level As Integer = 0) As Long
'* CALLED BY: Similarity * (RECURSIVE)
Dim lngCurr1 As Long, lngCurr2 As Long
Dim lngMatchAt1 As Long, lngMatchAt2 As Long
Dim i As Long
Dim lngLongestMatch As Long, lngLocalLongestMatch As Long
Dim strRetMatch1 As String, strRetMatch2 As String
If (start1 > end1) Or (start1 < 0) Or (end1 - start1 + 1 < min_match) _
Or (start2 > end2) Or (start2 < 0) Or (end2 - start2 + 1 < min_match) Then
Exit Function '(exit if start/end is out of string, or length is too short)
End If
For lngCurr1 = start1 To end1 '(for each char of first string)
For lngCurr2 = start2 To end2 '(for each char of second string)
i = 0
Do Until b1(lngCurr1 + i) <> b2(lngCurr2 + i) 'as long as chars DO match..
i = i + 1
If i > lngLongestMatch Then '..if longer than previous best, store starts & length
lngMatchAt1 = lngCurr1
lngMatchAt2 = lngCurr2
lngLongestMatch = i
End If
If (lngCurr1 + i) > end1 Or (lngCurr2 + i) > end2 Then Exit Do
Loop
Next lngCurr2
Next lngCurr1
If lngLongestMatch < min_match Then Exit Function 'no matches at all, so no point checking for sub-matches!
lngLocalLongestMatch = lngLongestMatch 'call again for BEFORE + AFTER
RetMatch = ""
'Find longest match BEFORE the current position
lngLongestMatch = lngLongestMatch _
+ Similarity_sub(start1, lngMatchAt1 - 1, _
start2, lngMatchAt2 - 1, _
b1, b2, _
FirstString, _
strRetMatch1, _
min_match, _
recur_level + 1)
If strRetMatch1 <> "" Then
RetMatch = RetMatch & strRetMatch1 & "*"
Else
RetMatch = RetMatch & IIf(recur_level = 0 _
And lngLocalLongestMatch > 0 _
And (lngMatchAt1 > 1 Or lngMatchAt2 > 1) _
, "*", "")
End If
'add local longest
RetMatch = RetMatch & Mid$(FirstString, lngMatchAt1 + 1, lngLocalLongestMatch)
'Find longest match AFTER the current position
lngLongestMatch = lngLongestMatch _
+ Similarity_sub(lngMatchAt1 + lngLocalLongestMatch, end1, _
lngMatchAt2 + lngLocalLongestMatch, end2, _
b1, b2, _
FirstString, _
strRetMatch2, _
min_match, _
recur_level + 1)
If strRetMatch2 <> "" Then
RetMatch = RetMatch & "*" & strRetMatch2
Else
RetMatch = RetMatch & IIf(recur_level = 0 _
And lngLocalLongestMatch > 0 _
And ((lngMatchAt1 + lngLocalLongestMatch < end1) _
Or (lngMatchAt2 + lngLocalLongestMatch < end2)) _
, "*", "")
End If
'Return result
Similarity_sub = lngLongestMatch
End Function
无需过多参与您的解决方案,这是您自己的责任,我可以建议一些方法来合并这些 缩写。然而。 请注意,此方法不能保证 100% 成功,但您已经处于模糊的世界中。
假设我们有一个 Dictionary
其中:
- 关键是长短语
- 数值为缩写
在比较两个字符串之前,我们最小化它们,方法是用缩写替换每个出现的长短语。然后我们可以将它们与您方法的其余部分进行比较Similarity
(或通过任何其他方法)。
' Fills an abbreviation dictionary
Sub InitializeDict(ByRef abbrev As Scripting.Dictionary)
abbrev("limited liability company") = "LLC"
abbrev("United Kingdom") = "U.K."
'... Add all abbreviations into dict
' Instead of harcoding, you can better load the key/value
' pairs from a dedicated worksheet...
End Sub
' Minimizes s by putting abbreviations
Sub Abbreviate(ByRef s As String)
Static abbrev As Scripting.Dictionary ' <-- static, inititlized only once
If abbrev Is Nothing Then
Set abbrev = CreateObject("Scripting.Dictionary")
abbrev.CompareMode = vbTextCompare
InitializeDict abbrev
End If
Dim phrase
For Each phrase In abbrev.Keys
s = Replace(s, phrase, abbrev(phrase), vbTextCompare)
Next
End Sub
' A small amendment to this function: abbreviate strings before comparing
Public Function Similarity(ByVal String1 As String, _
ByVal String2 As String, _
Optional ByRef RetMatch As String, _
Optional min_match = 1) As Single
Abbreviate String1
Abbreviate String2
' ... Rest of the routine
End Function
检查字符串是否 Like
可能更容易。例如
If "ABC limited liability company" Like "ABC L*L*C*" Then
为真,因为 *
匹配任何 0 个或多个字符。
Option Compare Text ' makes string comparisons case insensitive
Function areLike(str1 As String, str2 As String) As Single
If str1 = str2 Then areLike = 1: Exit Function
Dim pattern As String, temp As String
If LenB(str1) < LenB(str2) Then
pattern = str1
temp = str2
Else
pattern = str2
temp = str1
End If
pattern = StrConv(pattern, vbUnicode) ' "ABC LLC" to "A␀B␀C␀ ␀L␀L␀C␀"
pattern = Replace(pattern, vbNullChar, "*") ' "A*B*C* *L*L*C*"
pattern = Replace(pattern, " *", " ") ' "A*B*C* L*L*C*"
If temp Like pattern Then areLike = 1: Exit Function
' else areLike = some other similarity function
End Function
我正在使用 excel 中的以下 vb 代码来计算 A 列和 B 列之间的相似度。它运行良好。
我的下一步是定义首字母缩略词,这样计算出的相似度就不会受到影响。即:如果我在 A 列 "ABC LLC" 和 B 列 "ABC limited liability company" 中有,当前 vb 代码将 return 两列不是很相似。但是,我希望通过定义 "LLC" 和 "Limited Liability Company" 确实是同一件事,使它们 return 100% 相似。我可以做什么以及可以将其放在代码中的什么位置来完成此操作?谢谢!
免责声明 - 是的,我知道有加载项可以执行此操作。但是,我的数据集太大而无法使用它们。
Public Function Similarity(ByVal String1 As String, _
ByVal String2 As String, _
Optional ByRef RetMatch As String, _
Optional min_match = 1) As Single
'Returns percentile of similarity between 2 strings (ignores case)
'"RetMatch" returns the characters that match(in order)
'"min_match" specifies minimum number af char's in a row to match
Dim b1() As Byte, b2() As Byte
Dim lngLen1 As Long, lngLen2 As Long
Dim lngResult As Long
If UCase(String1) = UCase(String2) Then '..Exactly the same
Similarity = 1
Else '..one string is empty
lngLen1 = Len(String1)
lngLen2 = Len(String2)
If (lngLen1 = 0) Or (lngLen2 = 0) Then
Similarity = 0
Else '..otherwise find similarity
b1() = StrConv(UCase(String1), vbFromUnicode)
b2() = StrConv(UCase(String2), vbFromUnicode)
lngResult = Similarity_sub(0, lngLen1 - 1, _
0, lngLen2 - 1, _
b1, b2, _
String1, _
RetMatch, _
min_match)
Erase b1
Erase b2
If lngLen1 >= lngLen2 Then
Similarity = lngResult / lngLen1
Else
Similarity = lngResult / lngLen2
End If
End If
End If
End Function
Private Function Similarity_sub(ByVal start1 As Long, ByVal end1 As Long, _
ByVal start2 As Long, ByVal end2 As Long, _
ByRef b1() As Byte, ByRef b2() As Byte, _
ByVal FirstString As String, _
ByRef RetMatch As String, _
ByVal min_match As Long, _
Optional recur_level As Integer = 0) As Long
'* CALLED BY: Similarity * (RECURSIVE)
Dim lngCurr1 As Long, lngCurr2 As Long
Dim lngMatchAt1 As Long, lngMatchAt2 As Long
Dim i As Long
Dim lngLongestMatch As Long, lngLocalLongestMatch As Long
Dim strRetMatch1 As String, strRetMatch2 As String
If (start1 > end1) Or (start1 < 0) Or (end1 - start1 + 1 < min_match) _
Or (start2 > end2) Or (start2 < 0) Or (end2 - start2 + 1 < min_match) Then
Exit Function '(exit if start/end is out of string, or length is too short)
End If
For lngCurr1 = start1 To end1 '(for each char of first string)
For lngCurr2 = start2 To end2 '(for each char of second string)
i = 0
Do Until b1(lngCurr1 + i) <> b2(lngCurr2 + i) 'as long as chars DO match..
i = i + 1
If i > lngLongestMatch Then '..if longer than previous best, store starts & length
lngMatchAt1 = lngCurr1
lngMatchAt2 = lngCurr2
lngLongestMatch = i
End If
If (lngCurr1 + i) > end1 Or (lngCurr2 + i) > end2 Then Exit Do
Loop
Next lngCurr2
Next lngCurr1
If lngLongestMatch < min_match Then Exit Function 'no matches at all, so no point checking for sub-matches!
lngLocalLongestMatch = lngLongestMatch 'call again for BEFORE + AFTER
RetMatch = ""
'Find longest match BEFORE the current position
lngLongestMatch = lngLongestMatch _
+ Similarity_sub(start1, lngMatchAt1 - 1, _
start2, lngMatchAt2 - 1, _
b1, b2, _
FirstString, _
strRetMatch1, _
min_match, _
recur_level + 1)
If strRetMatch1 <> "" Then
RetMatch = RetMatch & strRetMatch1 & "*"
Else
RetMatch = RetMatch & IIf(recur_level = 0 _
And lngLocalLongestMatch > 0 _
And (lngMatchAt1 > 1 Or lngMatchAt2 > 1) _
, "*", "")
End If
'add local longest
RetMatch = RetMatch & Mid$(FirstString, lngMatchAt1 + 1, lngLocalLongestMatch)
'Find longest match AFTER the current position
lngLongestMatch = lngLongestMatch _
+ Similarity_sub(lngMatchAt1 + lngLocalLongestMatch, end1, _
lngMatchAt2 + lngLocalLongestMatch, end2, _
b1, b2, _
FirstString, _
strRetMatch2, _
min_match, _
recur_level + 1)
If strRetMatch2 <> "" Then
RetMatch = RetMatch & "*" & strRetMatch2
Else
RetMatch = RetMatch & IIf(recur_level = 0 _
And lngLocalLongestMatch > 0 _
And ((lngMatchAt1 + lngLocalLongestMatch < end1) _
Or (lngMatchAt2 + lngLocalLongestMatch < end2)) _
, "*", "")
End If
'Return result
Similarity_sub = lngLongestMatch
End Function
无需过多参与您的解决方案,这是您自己的责任,我可以建议一些方法来合并这些 缩写。然而。 请注意,此方法不能保证 100% 成功,但您已经处于模糊的世界中。
假设我们有一个 Dictionary
其中:
- 关键是长短语
- 数值为缩写
在比较两个字符串之前,我们最小化它们,方法是用缩写替换每个出现的长短语。然后我们可以将它们与您方法的其余部分进行比较Similarity
(或通过任何其他方法)。
' Fills an abbreviation dictionary
Sub InitializeDict(ByRef abbrev As Scripting.Dictionary)
abbrev("limited liability company") = "LLC"
abbrev("United Kingdom") = "U.K."
'... Add all abbreviations into dict
' Instead of harcoding, you can better load the key/value
' pairs from a dedicated worksheet...
End Sub
' Minimizes s by putting abbreviations
Sub Abbreviate(ByRef s As String)
Static abbrev As Scripting.Dictionary ' <-- static, inititlized only once
If abbrev Is Nothing Then
Set abbrev = CreateObject("Scripting.Dictionary")
abbrev.CompareMode = vbTextCompare
InitializeDict abbrev
End If
Dim phrase
For Each phrase In abbrev.Keys
s = Replace(s, phrase, abbrev(phrase), vbTextCompare)
Next
End Sub
' A small amendment to this function: abbreviate strings before comparing
Public Function Similarity(ByVal String1 As String, _
ByVal String2 As String, _
Optional ByRef RetMatch As String, _
Optional min_match = 1) As Single
Abbreviate String1
Abbreviate String2
' ... Rest of the routine
End Function
检查字符串是否 Like
可能更容易。例如
If "ABC limited liability company" Like "ABC L*L*C*" Then
为真,因为 *
匹配任何 0 个或多个字符。
Option Compare Text ' makes string comparisons case insensitive
Function areLike(str1 As String, str2 As String) As Single
If str1 = str2 Then areLike = 1: Exit Function
Dim pattern As String, temp As String
If LenB(str1) < LenB(str2) Then
pattern = str1
temp = str2
Else
pattern = str2
temp = str1
End If
pattern = StrConv(pattern, vbUnicode) ' "ABC LLC" to "A␀B␀C␀ ␀L␀L␀C␀"
pattern = Replace(pattern, vbNullChar, "*") ' "A*B*C* *L*L*C*"
pattern = Replace(pattern, " *", " ") ' "A*B*C* L*L*C*"
If temp Like pattern Then areLike = 1: Exit Function
' else areLike = some other similarity function
End Function