是否有像 Levenshtein Distance 这样专门针对一组有序字符的模糊搜索算法?
Is there an algorithm for fuzzy search like Levenshtein Distance specialised for a set of ordered character?
我找到了一个算法(在 https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance 上),在阅读了更多关于 levenshtein 的内容后,我明白如果这些字符串严格由 ascii 组成,应该有更好的方法来告诉两个字符串的编辑距离-按字母顺序排列且唯一的字符。
意思是,对于每个 a 和 b,如 a < b,a 将先于 b,并且对于每个 a、b 和 c,如 a < b < c,如果一个字符串读作ac,另一个读作ab,那么肯定知道第一个不包含b。
这恰恰意味着有一种更好的方法来确定此类两个字符串之间的编辑距离。
如果有用的话,我用来组织角色的 class 是角色树集。
这是我想出的解决方案:
我将它与值一起使用:
测试字符串,字符串目标,1、0。
/** returns the cost of the difference between a tested CharSequence and a target CharSequence. CS = CharSequence
* @param tested input, the CS which will be compared to the target. all letters sorted by ASCII order and unique
* @param target is the CS to which the tested will be compared. all letters sorted by ASCII order and unique
* @param positiveDifferenceCost is the cost to add when a letter is in the tested CS but not in the target.
* @param negativeDifferenceCost is the cost to add when a letter is in the target CS but not in the tested.
* @return int the number of differences.
*/
public static int oneSidedOAUDistance(final CharSequence tested, final CharSequence target,
final int positiveDifferenceCost, final int negativeDifferenceCost) {
int diffCount = 0;
int index_tested = 0;
int index_target = 0;
if (positiveDifferenceCost == 0 && negativeDifferenceCost == 0)
return 0;
for (; index_tested < tested.length() && index_target < target.length(); ) {
if (tested.charAt(index_tested) == target.charAt(index_target)) {
++index_tested;
++index_target;
continue;
}
if (tested.charAt(index_tested) < target.charAt(index_target)) {
//some letters should not be there in tested string.
diffCount+= positiveDifferenceCost;
index_tested++;
} else {
//some letters miss in tested string.
diffCount+=negativeDifferenceCost;
index_target++;
}
}
return diffCount;
}
我找到了一个算法(在 https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance 上),在阅读了更多关于 levenshtein 的内容后,我明白如果这些字符串严格由 ascii 组成,应该有更好的方法来告诉两个字符串的编辑距离-按字母顺序排列且唯一的字符。
意思是,对于每个 a 和 b,如 a < b,a 将先于 b,并且对于每个 a、b 和 c,如 a < b < c,如果一个字符串读作ac,另一个读作ab,那么肯定知道第一个不包含b。
这恰恰意味着有一种更好的方法来确定此类两个字符串之间的编辑距离。
如果有用的话,我用来组织角色的 class 是角色树集。
这是我想出的解决方案: 我将它与值一起使用: 测试字符串,字符串目标,1、0。
/** returns the cost of the difference between a tested CharSequence and a target CharSequence. CS = CharSequence
* @param tested input, the CS which will be compared to the target. all letters sorted by ASCII order and unique
* @param target is the CS to which the tested will be compared. all letters sorted by ASCII order and unique
* @param positiveDifferenceCost is the cost to add when a letter is in the tested CS but not in the target.
* @param negativeDifferenceCost is the cost to add when a letter is in the target CS but not in the tested.
* @return int the number of differences.
*/
public static int oneSidedOAUDistance(final CharSequence tested, final CharSequence target,
final int positiveDifferenceCost, final int negativeDifferenceCost) {
int diffCount = 0;
int index_tested = 0;
int index_target = 0;
if (positiveDifferenceCost == 0 && negativeDifferenceCost == 0)
return 0;
for (; index_tested < tested.length() && index_target < target.length(); ) {
if (tested.charAt(index_tested) == target.charAt(index_target)) {
++index_tested;
++index_target;
continue;
}
if (tested.charAt(index_tested) < target.charAt(index_target)) {
//some letters should not be there in tested string.
diffCount+= positiveDifferenceCost;
index_tested++;
} else {
//some letters miss in tested string.
diffCount+=negativeDifferenceCost;
index_target++;
}
}
return diffCount;
}