Python 查找字符串中的相似序列

Question

我想要一个代码 return 两个字符串中所有相似序列的总和。我写了下面的代码，但它只是 return 其中之一

from difflib import SequenceMatcher
a='Apple Banana'
b='Banana Apple'
def similar(a,b):
    c = SequenceMatcher(None,a.lower(),b.lower()).get_matching_blocks()
    return sum( [c[i].size if c[i].size>1 else 0 for i in range(0,len(c)) ] )
print similar(a,b)

输出将是

我预计是：11

Answer 1

当我们将您的代码编辑成这样时，它会告诉我们 6 来自哪里：

from difflib import SequenceMatcher
a='Apple Banana'
b='Banana Apple'
def similar(a,b):
    c = SequenceMatcher(None,a.lower(),b.lower()).get_matching_blocks()
    for block in c:
        print "a[%d] and b[%d] match for %d elements" % block
print similar(a,b)

a[6] and b[0] match for 6 elements

a[12] and b[12] match for 0 elements

Answer 2

get_matching_blocks() returns最长的连续匹配子序列。这里最长的匹配子序列在两个字符串中都是 'banana'，长度为 6。因此它返回 6.

试试这个：

def similar(a,b):
    c = 'something' # Initialize this to anything to make the while loop condition pass for the first time
    sum = 0

    while(len(c) != 1):
        c = SequenceMatcher(lambda x: x == ' ',a.lower(),b.lower()).get_matching_blocks()

        sizes = [i.size for i in c]
        i = sizes.index(max(sizes))
        sum += max(sizes)

        a = a[0:c[i].a] + a[c[i].a + c[i].size:]
        b = b[0:c[i].b] + b[c[i].b + c[i].size:]

    return sum

这 "subtracts" 字符串的匹配部分，并再次匹配它们，直到 len(c) 为 1，这将在没有更多匹配剩余时发生。

但是，此脚本不会忽略空格。为了做到这一点，我使用了 this other SO answer 的建议：在将字符串传递给函数之前只对字符串进行预处理：

a = 'Apple Banana'.replace(' ', '')
b = 'Banana Apple'.replace(' ', '')

您也可以将此部分包含在函数中。

Answer 3

我对你的代码做了一点小改动，效果非常好，谢谢@Antimony

def similar(a,b):
    a=a.replace(' ', '')
    b=b.replace(' ', '')

    c = 'something' # Initialize this to anything to make the while loop condition pass for the first time
    sum = 0
    i = 2
    while(len(c) != 1):
        c = SequenceMatcher(lambda x: x == ' ',a.lower(),b.lower()).get_matching_blocks()
        sizes = [i.size for i in c]
        i = sizes.index(max(sizes))
        sum += max(sizes)
        a = a[0:c[i].a] + a[c[i].a + c[i].size:]
        b = b[0:c[i].b] + b[c[i].b + c[i].size:]
    return sum

Python 查找字符串中的相似序列

Python find similar sequences in string

python

text

difflib

python-2.7