不删减单词的最长公共子串

Question

我是编程新手，我正在尝试解决 Java 中最长的常见 sequence/substring 问题之一。所以我正在研究的算法问题是在不切割单词的情况下找到最长的公共子串。

例如：给定 string1 = He had 3 pennies and 5 quarters 和 string2 = q3nniesp 应该 return pennies。

其他示例：string1 = They named the new place face cafe 和 string2 = e face，输出将是 e face cafe。

我正在尝试弄清楚这个算法，但我无法决定是否需要将它们转换为 char 数组或将其评估为字符串。两个字符串都可以有空格的方式让我感到困惑。

我关注了一些现有的 Whosebug 问题并尝试修改此代码 https://www.geeksforgeeks.org/:

static String findLongestSubsequence(String str1, String str2) {

        char[] A = str1.toCharArray();
        char[] B = str2.toCharArray();
        if (A == null || B == null) return null;

        final int n = A.length;
        final int m = B.length;

        if (n == 0 || m == 0) return null;

        int[][] dp = new int[n+1][m+1];

        // Suppose A = a1a2..an-1an and B = b1b2..bn-1bn
        for (int i = 1; i <= n; i++ ) {
            for (int j = 1; j <= m; j++) {

                // If ends match the LCS(a1a2..an-1an, b1b2..bn-1bn) = LCS(a1a2..an-1, b1b2..bn-1) + 1
                if (A[i-1] == B[j-1]) dp[i][j] = dp[i-1][j-1] + 1;

                    // If the ends do not match the LCS of a1a2..an-1an and b1b2..bn-1bn is
                    // max( LCS(a1a2..an-1, b1b2..bn-1bn), LCS(a1a2..an-1an, b1b2..bn-1) )
                else dp[i][j] = Math.max(dp[i-1][j], dp[i][j-1]);

            }
        }

        int lcsLen = dp[n][m];
        char[] lcs = new char[lcsLen];
        int index = 0;

        // Backtrack to find a LCS. We search for the cells
        // where we included an element which are those with
        // dp[i][j] != dp[i-1][j] and dp[i][j] != dp[i][j-1])
        int i = n, j = m;
        while (i >= 1 && j >= 1) {

            int v = dp[i][j];

            // The order of these may output different LCSs
            while(i > 1 && dp[i-1][j] == v) i--;
            while(j > 1 && dp[i][j-1] == v) j--;

            // Make sure there is a match before adding
            if (v > 0) lcs[lcsLen - index++ - 1] = A[i-1]; // or B[j-1];

            i--; j--;

        }

        return new String(lcs, 0, lcsLen);
    }

但我总是得到错误的输出。例如第一个输出给出 output = 3nnies，我真的被困在这一点上，任何人都可以提供帮助或一点点算法吗？谢谢大家。

Answer 1

我尝试了你原来的算法，不幸的是，它的方向不对。

我假设以下准则适用：

匹配的子字符串包含给定子字符串中可能乱序的字符。
给定子字符串中的一个字符可能会在匹配的子字符串中出现多次。

因此，我冒昧地在使用 java 流时使用了蛮力算法：

// complexity of O(N*M), where N is the length of the original string and M is the length of the substring
static String longestCommonSequence(String string, String sub) {
    List<Character> primaryMatch = new ArrayList<>();
    List<Character> secondaryMatch = new ArrayList<>();
    // N iterations loop on original string
    for(char c : string.toCharArray()) {
      // M iterations loop on the substring
      if(sub.indexOf(c) != -1) {
        primaryMatch.add(c);
      }
      else {
        if(!primaryMatch.isEmpty()) {
          // replace secondaryMatch content with the current longet match
          if(secondaryMatch.size() <= primaryMatch.size()) {
            secondaryMatch.clear();
            secondaryMatch.addAll(primaryMatch);
          }
          primaryMatch.clear();
        }
      }
    }
    if(primaryMatch.size() < secondaryMatch.size()) {
      return secondaryMatch.stream().map(String::valueOf).collect(Collectors.joining());
    }
    return primaryMatch.stream().map(String::valueOf).collect(Collectors.joining());
}

您提供的输入的输出是：

string1 = He had 3 pennies and 5 quarters and string2 = q3nniesp ---> pennies
string1 = They named the new place face cafe and string2 = e face ---> ace face cafe

注意第二个输出中的差异 - 根据您描述的输出行为，上述算法的结果是正确的，因为 ace face cafe 比 e face cafe 长，因为多次使用允许给定子字符串中的字符数。

注意算法中的一个细微问题： if(secondaryMatch.size() <= primaryMatch.size())

如果有多个相同长度的匹配子字符串，当前的实现将return最后一个匹配（基于原始字符串中的字符顺序）。如果您希望 return 第一个匹配项，请将 <= 替换为 <。

如果不允许我描述的假设 - 请对此答案发表评论，我会根据您的规格进行更新。

不删减单词的最长公共子串

Longest Common Substring without cutting a word

java

algorithm

longest-substring