在 Java 中使用 Levenshtein 距离改善搜索结果

Improving search result using Levenshtein distance in Java

我有以下有效的 Java 代码,用于根据单词列表搜索单词,它工作得很好,符合预期:

public class Levenshtein {
    private int[][] wordMartix;

    public Set similarExists(String searchWord) {

        int maxDistance = searchWord.length();
        int curDistance;
        int sumCurMax;
        String checkWord;

        // preventing double words on returning list
        Set<String> fuzzyWordList = new HashSet<>();

        for (Object wordList : Searcher.wordList) {
            checkWord = String.valueOf(wordList);
            curDistance = calculateDistance(searchWord, checkWord);
            sumCurMax = maxDistance + curDistance;
            if (sumCurMax == checkWord.length()) {
                fuzzyWordList.add(checkWord);
            }
        }
        return fuzzyWordList;
    }

    public int calculateDistance(String inputWord, String checkWord) {
        wordMartix = new int[inputWord.length() + 1][checkWord.length() + 1];

        for (int i = 0; i <= inputWord.length(); i++) {
            wordMartix[i][0] = i;
        }

        for (int j = 0; j <= checkWord.length(); j++) {
            wordMartix[0][j] = j;
        }

        for (int i = 1; i < wordMartix.length; i++) {
            for (int j = 1; j < wordMartix[i].length; j++) {
                if (inputWord.charAt(i - 1) == checkWord.charAt(j - 1)) {
                    wordMartix[i][j] = wordMartix[i - 1][j - 1];
                } else {
                    int minimum = Integer.MAX_VALUE;
                    if ((wordMartix[i - 1][j]) + 1 < minimum) {
                        minimum = (wordMartix[i - 1][j]) + 1;
                    }

                    if ((wordMartix[i][j - 1]) + 1 < minimum) {
                        minimum = (wordMartix[i][j - 1]) + 1;
                    }

                    if ((wordMartix[i - 1][j - 1]) + 1 < minimum) {
                        minimum = (wordMartix[i - 1][j - 1]) + 1;
                    }

                    wordMartix[i][j] = minimum;
                }
            }
        }

        return wordMartix[inputWord.length()][checkWord.length()];
    }

}

现在,当我搜索像 job 这样的词时,它 return 是一个列表:

输出

joborienterede
jobannoncer
jobfunktioner
perjacobsen
jakobsen
jobprofiler
jacob
jobtitler
jobbet
jobdatabaserne
jobfunktion
jakob
jobs
studenterjobber
johannesburg
jobmuligheder
jobannoncerne
jobbaser
job
joberfaringer

如您所见,输出有很多相关词,但也有不相关的词,如 jakobjacob 等,这对于 Levenshtein 公式是正确的,但我会喜欢进一步构建并编写一个可以微调我的搜索的方法,这样我可以获得更多相关和相关的词。

我已经为此工作了几个小时,却失去了创造力。

我的问题:是否可以将现有方法微调到returnrelevant/related字或者我应该采取另一种方法还是???在所有情况下都是“是”或“否”,我很感激能否获得有关改进搜索结果的意见和灵感?


更新

很久以前问过这个问题后,我还没有真正找到解决方案,所以我又回来了,因为现在是我需要一个有用答案的时候了,可以用[=提供答案76=] 代码示例,但最重要的是详细的答案,其中描述了用于索引最佳和最相关搜索结果并忽略 none 相关词 的可用方法和方法。我知道这是一个开放而无边无际的领域,但我需要一些灵感才能从哪里开始。

Note: The oldest answer right now is based on one of the comment inputs and is not helpful (useless), it just sorting the distance, that does not mean getting better search results/quality.

于是我做了距离排序,结果是这样的:

job
jobs
jacob
jakob
jobbet
jakobsen
jobbaser
jobtitler
jobannoncer
jobfunktion
jobprofiler
perjacobsen
johannesburg
jobannoncerne
joberfaringer
jobfunktioner
jobmuligheder
jobdatabaserne
joborienterede
studenterjobber

所以单词 jobbaser 是相关的,jacob/jakob 是不相关的,但是 jobbaser 的距离比 jacob/jakob 大。所以这并没有真正帮助。


关于答案的一般反馈


谢谢 我想亲自感谢所有为这个问题做出贡献的人,我得到了很好的答案和有用的评论。

特别感谢@SergioMontoro、@uSeemSurprised 和@Gene 的回答,这些是不同但有效且有用的答案。

@D.Kovács 指出了一些有趣的解决方案。

我希望我能对所有这些答案给予赏金。 选择一个答案并给予赏金,这并不意味着其他答案无效,但这仅意味着我选择的特定答案对我有用。

我尝试了评论中关于按 Levenshtein 算法返回的距离对匹配进行排序的建议,它似乎确实产生了更好的结果。

(因为我无法从您的代码中找到搜索器 class,所以我冒昧地使用了不同来源的词表、Levenshtein 实现和语言。)

使用 Ubuntu 中提供的单词列表,以及来自 - https://github.com/ztane/python-Levenshtein 的 Levenshtein 算法实现, 我创建了一个小脚本,它要求一个词并将所有最接近的词和距离打印为元组。

代码 - https://gist.github.com/atdaemon/9f59ad886c35024bdd28

from Levenshtein import distance
import os

def read_dict() :
    with open('/usr/share/dict/words','r') as f : 
        for line in f :
            yield str(line).strip()

inp = str(raw_input('Enter a word : '))

wordlist = read_dict()
matches = []
for word in wordlist :
    dist = distance(inp,word)
    if dist < 3 :
        matches.append((dist,word))
print os.linesep.join(map(str,sorted(matches)))

示例输出 -

Enter a word : job
(0, 'job')
(1, 'Bob')
(1, 'Job')
(1, 'Rob')
(1, 'bob')
(1, 'cob')
(1, 'fob')
(1, 'gob')
(1, 'hob')
(1, 'jab')
(1, 'jib')
(1, 'jobs')
(1, 'jog')
(1, 'jot')
(1, 'joy')
(1, 'lob')
(1, 'mob')
(1, 'rob')
(1, 'sob')
...

Enter a word : checker
(0, 'checker')
(1, 'checked')
(1, 'checkers')
(2, 'Becker')
(2, 'Decker')
(2, 'cheaper')
(2, 'cheater')
(2, 'check')
(2, "check's")
(2, "checker's")
(2, 'checkered')
(2, 'checks')
(2, 'checkup')
(2, 'cheeked')
(2, 'cheekier')
(2, 'cheer')
(2, 'chewer')
(2, 'chewier')
(2, 'chicer')
(2, 'chicken')
(2, 'chocked')
(2, 'choker')
(2, 'chucked')
(2, 'cracker')
(2, 'hacker')
(2, 'heckler')
(2, 'shocker')
(2, 'thicker')
(2, 'wrecker')

在不理解@DrYap 所建议的单词含义的情况下,比较两个单词的下一个逻辑单元(如果您不寻找拼写错误)是音节。很容易修改 Levenshtein 来比较音节而不是字符。困难的部分是将单词分解成音节。有一个 Java 实现 TeXHyphenator-J which can be used to split the words. Based on this hyphenation library, here is a modified version of Levenshtein function written by Michael Gilleland & Chas Emerick. More about syllable detection here and here。当然,您需要避免两个单音节单词的音节比较,这可能是使用标准 Levenshtein 处理这种情况的原因。

import net.davidashen.text.Hyphenator;

public class WordDistance {

    public static void main(String args[]) throws Exception {
        Hyphenator h = new Hyphenator();
        h.loadTable(WordDistance.class.getResourceAsStream("hyphen.tex"));
        getSyllableLevenshteinDistance(h, args[0], args[1]);
    }

    /**
     * <p>
     * Calculate Syllable Levenshtein distance between two words </p>
     * The Syllable Levenshtein distance is defined as the minimal number of
     * case-insensitive syllables you have to replace, insert or delete to transform word1 into word2.
     * @return int
     * @throws IllegalArgumentException if either str1 or str2 is <b>null</b>
     */
    public static int getSyllableLevenshteinDistance(Hyphenator h, String s, String t) {
        if (s == null || t == null)
            throw new NullPointerException("Strings must not be null");

        final String hyphen = Character.toString((char) 173);
        final String[] ss = h.hyphenate(s).split(hyphen);
        final String[] st = h.hyphenate(t).split(hyphen);

        final int n = ss.length;
        final int m = st.length;

        if (n == 0)
            return m;
        else if (m == 0)
            return n;

        int p[] = new int[n + 1]; // 'previous' cost array, horizontally
        int d[] = new int[n + 1]; // cost array, horizontally

        for (int i = 0; i <= n; i++)
            p[i] = i;

        for (int j = 1; j <= m; j++) {
            d[0] = j;
            for (int i = 1; i <= n; i++) {
                int cost = ss[i - 1].equalsIgnoreCase(st[j - 1]) ? 0 : 1;
                // minimum of cell to the left+1, to the top+1, diagonally left and up +cost
                d[i] = Math.min(Math.min(d[i - 1] + 1, p[i] + 1), p[i - 1] + cost);
            }
            // copy current distance counts to 'previous row' distance counts
            int[] _d = p;
            p = d;
            d = _d;
        }

        // our last action in the above loop was to switch d and p, so p now actually has the most recent cost counts
        return p[n];
    }

}

您可以通过调整连续字符匹配时的评分来修改编辑距离。

只要有连续的字符匹配,就可以降低分数,从而使搜索更相关。

例如:假设我们要将分数降低的因子是 10 那么如果在单词中找到子字符串 "job" 我们可以在遇到 "j" 时将分数降低 10当我们找到字符串 "jo" 时进一步将其减少 (10 + 20),当我们找到 "job".

时最终将分数减少 (10 + 20 + 30)

我在下面写了一个c++代码:

#include <bits/stdc++.h>

#define INF -10000000
#define FACTOR 10

using namespace std;

double memo[100][100][100];

double Levenshtein(string inputWord, string checkWord, int i, int j, int count){
    if(i == inputWord.length() && j == checkWord.length()) return 0;    
    if(i == inputWord.length()) return checkWord.length() - j;
    if(j == checkWord.length()) return inputWord.length() - i;
    if(memo[i][j][count] != INF) return memo[i][j][count];

    double ans1 = 0, ans2 = 0, ans3 = 0, ans = 0;
    if(inputWord[i] == checkWord[j]){
        ans1 = Levenshtein(inputWord, checkWord, i+1, j+1, count+1) - (FACTOR*(count+1));
        ans2 = Levenshtein(inputWord, checkWord, i+1, j, 0) + 1;
        ans3 = Levenshtein(inputWord, checkWord, i, j+1, 0) + 1;
        ans = min(ans1, min(ans2, ans3));
    }else{
        ans1 = Levenshtein(inputWord, checkWord, i+1, j, 0) + 1;
        ans2 = Levenshtein(inputWord, checkWord, i, j+1, 0) + 1;
        ans = min(ans1, ans2);
    }
    return memo[i][j][count] = ans;
}

int main(void) {
    // your code goes here
    string word = "job";
    string wordList[40];
    vector< pair <double, string> > ans;
    for(int i = 0;i < 40;i++){
        cin >> wordList[i];
        for(int j = 0;j < 100;j++) for(int k = 0;k < 100;k++){
            for(int m = 0;m < 100;m++) memo[j][k][m] = INF;
        }
        ans.push_back( make_pair(Levenshtein(word, wordList[i], 
            0, 0, 0), wordList[i]) );
    }
    sort(ans.begin(), ans.end());
    for(int i = 0;i < ans.size();i++){
        cout << ans[i].second << " " << ans[i].first << endl;
    }
    return 0;
}

Link 演示:http://ideone.com/4UtCX3

这里的FACTOR取10,你可以换个词试试,选择合适的值。

另请注意,上述 Levenshtein 距离的复杂性也有所增加,现在是 O(n^3) 而不是 O(n^2),因为现在我们还跟踪计算连续字符数的计数器我们遇到过。

您可以通过在找到一些连续的子字符串然后不匹配后逐渐增加分数来进一步调整分数,而不是我们将固定分数 1 添加到总分数的当前方式。

此外,在上述解决方案中,您还可以删除分数 >=0 的字符串,因为它们根本不相关,您还可以选择其他一些阈值来进行更准确的搜索。

这确实是一个开放式问题,但我建议使用一种替代方法,例如使用 Smith-Waterman algorithm as described in .

另一种(更轻量级)解决方案是使用来自 NLP 的其他 distance/similarity 指标(例如,Cosine similarity or Damerau–Levenshtein distance)。

既然你问了,我就展示一下 UMBC semantic network 是如何处理这种事情的。不确定这是否是您真正想要的:

import static java.lang.String.format;
import static java.util.Comparator.comparingDouble;
import static java.util.stream.Collectors.toMap;
import static java.util.function.Function.identity;

import java.util.Map.Entry;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Arrays;
import java.util.regex.Pattern;

public class SemanticSimilarity {
  private static final String GET_URL_FORMAT
      = "http://swoogle.umbc.edu/SimService/GetSimilarity?"
          + "operation=api&phrase1=%s&phrase2=%s";
  private static final Pattern VALID_WORD_PATTERN = Pattern.compile("\w+");
  private static final String[] DICT = {
    "cat",
    "building",
    "girl",
    "ranch",
    "drawing",
    "wool",
    "gear",
    "question",
    "information",
    "tank" 
  };

  public static String httpGetLine(String urlToRead) throws IOException {
    URL url = new URL(urlToRead);
    HttpURLConnection conn = (HttpURLConnection) url.openConnection();
    conn.setRequestMethod("GET");
    try (BufferedReader reader = new BufferedReader(
        new InputStreamReader(conn.getInputStream()))) {
      return reader.readLine();
    }
  }

  public static double getSimilarity(String a, String b) {
    if (!VALID_WORD_PATTERN.matcher(a).matches()
        || !VALID_WORD_PATTERN.matcher(b).matches()) {
      throw new RuntimeException("Bad word");
    }
    try {
      return Double.parseDouble(httpGetLine(format(GET_URL_FORMAT, a, b)));
    } catch (IOException | NumberFormatException ex) {
      return -1.0;
    }
  }

  public static void test(String target) throws IOException {
    System.out.println("Target: " + target);
    Arrays.stream(DICT)
        .collect(toMap(identity(), word -> getSimilarity(target, word)))
        .entrySet().stream()
        .sorted((a, b) -> Double.compare(b.getValue(), a.getValue()))
        .forEach(System.out::println);
    System.out.println();
  }

  public static void main(String[] args) throws Exception {
    test("sheep");
    test("vehicle");
    test("house");
    test("data");
    test("girlfriend");
  }
}

结果有点令人着迷:

Target: sheep
ranch=0.38563728
cat=0.37816614
wool=0.36558008
question=0.047607
girl=0.0388761
information=0.027191084
drawing=0.0039623436
tank=0.0
building=0.0
gear=0.0

Target: vehicle
tank=0.65860236
gear=0.2673374
building=0.20197356
cat=0.06057514
information=0.041832563
ranch=0.017701812
question=0.017145569
girl=0.010708235
wool=0.0
drawing=0.0

Target: house
building=1.0
ranch=0.104496084
tank=0.103863
wool=0.059761923
girl=0.056549154
drawing=0.04310725
cat=0.0418914
gear=0.026439993
information=0.020329408
question=0.0012588014

Target: data
information=0.9924584
question=0.03476312
gear=0.029112043
wool=0.019744944
tank=0.014537057
drawing=0.013742204
ranch=0.0
cat=0.0
girl=0.0
building=0.0

Target: girlfriend
girl=0.70060706
ranch=0.11062875
cat=0.09766617
gear=0.04835723
information=0.02449007
wool=0.0
question=0.0
drawing=0.0
tank=0.0
building=0.0