如何使用 Rust、Python、javascript 找到所有常见的子字符串？

Question

问题背景

Git 是一个很棒的版本控制系统，我想通过编写自己的版本控制系统来学习 git。我要做的第一步是实现一个字符串差异工具。我读过这个 blog and this paper。为了获得两个字符串的差异，我需要找到公共部分。因此，我遇到了问题：如何找到两个字符串的所有公共子字符串？

这是我的问题的第一部分：算法问题。

这是我使用的算法：

算法题

【问题】找出string1和string2的所有公共子串

【求解】

将 string1 的所有子字符串与 string2 进行比较，然后将匹配项收集到答案中。
将 string2 的所有子字符串与 string1 进行比较，然后将匹配项收集到答案中。

这个算法是 O(N^2) 时间复杂度。

语言问题

为了证明我的想法，我可以很容易地用 Python 实现它：

def commonSubstringFinder(string1, string2):
    answer=[]
    len1, len2 = len(string1), len(string2)
    match = ""
    x,y = 0,0
    for i in range(len2):
        for j in range(len1):
            if ( i+j < len2) and (string1[j] == string2[i+j]):
                if len(match)==0:
                    x,y=i+j,j
                match += string1[j]
            else:
                if len(match)>0:
                    answer.append(((x,y), match))
                    match=""
    
    for i in range(1,len1):
        for j in range(len2):
            if (i+j<len1 and string1[i+j] == string2[j] ):
                if len(match)==0:
                    x,y=j,i+j
                match += string2[j]
            else:
                if len(match)>0:
                    answer.append(((x,y), match))
                    match=""
                
    return answer

print(commonSubstringFinder("apkleses", "appleses"))
print(commonSubstringFinder("cappleses", "caplekses"))

# [((0, 0), 'ap'), ((3, 3), 'leses'), ((2, 1), 'p'), ((6, 4), 'es'), ((4, 6), 'es')]
# [((0, 0), 'cap'), ((6, 6), 'ses'), ((7, 5), 'es'), ((2, 3), 'ple'), ((6, 8), 's'), ((4, 7), 'e')]

但是，我真的很难通过使用 Rust 编程语言来迁移这个算法。问题是 Rust 编程语言总是使用 iter（比如 string1.chars()）来获取字符而不是索引。

谁能帮帮我？

结论

我在解决这个问题的时候发现了一个有趣的事情：生物学家是如何找到两个DNA或RNA的共同部分的？

Answer 1

2021-10-27 更新

我发现 Google 很棒的回购：

https://github.com/google/diff-match-patch

但是没有rust语言。

旧答案

感谢您的回复。我在这里找到了一个简单的解决方案。

欢迎升级它:

#![allow(unused)]

#[derive(Debug)]
struct Common(u32, String);

pub fn common(old_content: &str, new_content: &str) {
    let mut commons: Vec<Common> = vec![];

    let mut sub = String::new();
    // cdx is the start index of common substring in old content.
    let mut cdx = 0u32;
    let mut idx = 0u32; // inner loop counter
    let mut odx = 0u32; // outer loop counter
    let mut new_iter = new_content.chars().peekable();
    while (!new_iter.peek().is_none()) {
        for (old_ch, new_ch) in old_content.chars().zip(new_iter.clone()) {
            if old_ch == new_ch {
                if sub.is_empty() {
                    cdx = idx;
                }
                sub.push(old_ch);
            } else {
                if sub.len() > 0 {
                    commons.push(Common(cdx, sub.clone()));
                    sub.clear();
                }
            }
            idx += 1;
        }
        new_iter.next();
        odx += 1;
        idx = 0;
    }

    odx = 1;
    idx = 0;
    let mut old_iter = old_content.chars().skip(1).peekable();
    while (!old_iter.peek().is_none()) {
        for (new_ch, old_ch) in new_content.chars().zip(old_iter.clone()) {
            if old_ch == new_ch {
                if sub.is_empty() {
                    cdx = odx + idx;
                }
                sub.push(old_ch);
            } else {
                if sub.len() > 0 {
                    commons.push(Common(cdx, sub.clone()));
                    sub.clear();
                }
            }
            idx += 1;
        }
        old_iter.next();
        odx += 1;
        idx = 0;
    }
    println!("{:?}", commons);
}

#[cfg(test)]
mod tests {
    use self::super::*;
    #[test]
    fn test_common() {
        // test with print output: cargo test -- --nocapture
        common("apkleses", "appleses");
        common("cappleses", "caplekses");
    }
}

该解决方案可能不会生锈，但它确实有效。

2021-11-3 更新

我找到一个非迭代器解决方案：将元素收集到一个 Vec 中。

如何使用 Rust、Python、javascript 找到所有常见的子字符串？

How can I find all common sub strings using Rust , Python, javascript?

python

string

iterator

substring

rust

问题背景

算法题

语言问题

结论

2021-10-27 更新

旧答案

2021-11-3 更新