如何找到两个多行字符串之间的相似度百分比？

Question

我有两个多行字符串。我正在使用以下代码来确定其中两个之间的相似性。这利用了 Levenshtein 距离算法。

  public static double similarity(String s1, String s2) {
    String longer = s1, shorter = s2;
    if (s1.length() < s2.length()) { 
      longer = s2; shorter = s1;
    }
    int longerLength = longer.length();
    if (longerLength == 0) { return 1.0; /* both strings are zero length */ }

    return (longerLength - editDistance(longer, shorter)) / (double) longerLength;

  }

  public static int editDistance(String s1, String s2) {
    s1 = s1.toLowerCase();
    s2 = s2.toLowerCase();

    int[] costs = new int[s2.length() + 1];
    for (int i = 0; i <= s1.length(); i++) {
      int lastValue = i;
      for (int j = 0; j <= s2.length(); j++) {
        if (i == 0)
          costs[j] = j;
        else {
          if (j > 0) {
            int newValue = costs[j - 1];
            if (s1.charAt(i - 1) != s2.charAt(j - 1))
              newValue = Math.min(Math.min(newValue, lastValue),
                  costs[j]) + 1;
            costs[j - 1] = lastValue;
            lastValue = newValue;
          }
        }
      }
      if (i > 0)
        costs[s2.length()] = lastValue;
    }
    return costs[s2.length()];
  }

但是上面的代码没有按预期工作。

例如假设我们有以下两个字符串 s1 和 s2,

S1 -> How do we optimize the performance? . What should we do to compare both strings to find the percentage of similarity between both?

S2->How do we optimize tje performance? What should we do to compare both strings to find the percentage of similarity between both?

然后我将上面的字符串传递给相似性方法，但它没有找到确切的差异百分比。如何优化算法？

以下是我的主要方法

更新:

public static boolean authQuestion(String question) throws SQLException{


        boolean isQuestionAvailable = false;
        Connection dbCon = null;
        try {
            dbCon = MyResource.getConnection();
            String query = "SELECT * FROM WORDBANK where WORD ~*  ?;";
            PreparedStatement checkStmt = dbCon.prepareStatement(query);
            checkStmt.setString(1, question);
            ResultSet rs = checkStmt.executeQuery();
            while (rs.next()) {
                double re=similarity( rs.getString("question"), question);
                if(re  > 0.6){
                    isQuestionAvailable = true;
                }else {
                    isQuestionAvailable = false;
                }
            }
        } catch (URISyntaxException e1) {
            e1.printStackTrace();
        } catch (SQLException sqle) {
            sqle.printStackTrace();
        } catch (Exception e) {
            if (dbCon != null)
                dbCon.close();
        } finally {
            if (dbCon != null)
                dbCon.close();
        }

        return isQuestionAvailable;
    }

Answer 1

我可以建议你一个方法...

您正在使用编辑距离，它为您提供 S1 中需要 change/add/remove 以便将其转换为 S2 的字符数。

因此，例如：

S1 = "abc"
S2 = "cde"

编辑距离为 3，它们 100% 不同（考虑到您在某种字符比较中看到的）。

所以你可以得到一个大概的百分比

S1 = "abc"
S2 = "cde"
edit = edit_distance(S1, S2)
percentage = min(edit/S1.length(), edit/S2.length())

min 是一种解决方法，用于处理字符串非常不同的情况，例如：

S1 = "abc"
S2 = "defghijklmno"

所以编辑距离会大于S1的长度，百分比应该大于100%，所以除以较大的尺寸应该更好。

希望对您有所帮助

Answer 2

您的 similarity 方法 returns 介于 0 和 1 之间的数字（包括两端），其中 1 表示字符串相同（编辑距离为零）。

然而，在您的 authQuestion 方法中，您的行为就好像它 returns 一个介于 0 和 100 之间的数字，这一行证明了这一点：

if(re > 60){

您需要将其更改为

if(re > .6){

或到

if(re * 100 > 60){

Answer 3

由于您在 sql 查询的 where 子句 中使用了整个 S1，它要么会找到完美匹配，要么不会 return 任何结果都没有。

如@ErwinBolwidt所述，如果return什么都没有那么你isQuestionAvailable将永远保持false. 如果它 return 是 完美匹配 那么你一定会得到 100% 相似度.

您可以做的是：使用 S1 的 子字符串搜索与该部分匹配的问题。

您可以进行以下更改：

authQuestion method

checkStmt.setString(1, question.substring(0,20)); //say

在获取的结果中，您可以将每个结果与您的问题进行相似性比较。

如何找到两个多行字符串之间的相似度百分比？

How do I find the percentage of similarity between two multiline Strings?

java

algorithm

levenshtein-distance