如何找到 python 中多个字符串的相似之处？

Question

我想得到什么在几个字符串中是相似的。比如我有6个字符串：

HELLO3456
helf04g
hell0r
h31l0

我想得到这些字符串中的相似之处，例如在这种情况下我希望它告诉我类似的内容：

h is always at the start

这个例子非常简单，我可以在脑海中想出它，但是像这样：

61TvA2dNwxNxmWziZxKzR5aO9tFD00Nj
pHHlgpFt8Ka3Stb5UlTxcaEwciOeF2QM
fW9K4luEx65RscfUiPDakiqp15jiK5f6
17xz7MYEBoXLPoi8RdqbgkPwTV2T2H0y
Jvt0B5uZIDPJ5pbCqMo12CqD7pdnMSEd
n7voYT0TVVzZGVSLaQNRnnkkWgVqxA3b

没那么容易。我已经看到并尝试过：

Find the similarity metric between two strings

仅举几例，但它们都不是我要找的。他们给出的值如何与他们相似，我需要知道什么与他们相似。

我想知道这是否可行，如果可行，我该怎么做。提前谢谢你。

Answer 1

我认为这应该是您想要的解决方案。我在每个字符串的开头添加了“a”，否则您提到的字符串没有相似之处。

lst = ["A61TvA2dNwxNxmWziZxKzR5aO9tFD00Nj","apHHlgpFt8Ka3Stb5UlTxcaEwciOeF2QM","afW9K4luEx65RscfUiPDakiqp15jiK5f6","a17xz7MYEBoXLPoi8RdqbgkPwTV2T2H0y", "aJvt0B5uZIDPJ5pbCqMo12CqD7pdnMSEd","an7voYT0TVVzZGVSLaQNRnnkkWgVqxA3b"]
total_strings = len(lst)
string_length = len(lst[0])
for i in range(total_strings):
    lst[i] = lst[i].lower()

for i in range(string_length):
    flag = 0
    lst_char = lst[total_strings-1][i]
    for j in range(total_strings-1):
        if lst[j][i] == lst_char:
            flag = 1
            continue
        else:
            flag = 0
            break
    if flag == 1:
        print(lst[total_strings-1][i]+" is always at position "+str(i))

Answer 2

最小解

您使用 difflib 库的方法是正确的。我只是从你的问题中选择了前两个例子来创建一个最小的解决方案。

from difflib import SequenceMatcher


a = "61TvA2dNwxNxmWziZxKzR5aO9tFD00Nj"
b = "pHHlgpFt8Ka3Stb5UlTxcaEwciOeF2QM"

Sequencer = SequenceMatcher(None, a, b)

print(Sequencer.ratio())
matches = Sequencer.get_matching_blocks()
print(matches)

for match in matches:
    idx_a = match.a
    idx_b = match.b
    
    if not (idx_a == len(a) or idx_b == len(b)):
        print(30*'-' + 'Found Match' + 30*'-')
        print('found at idx {} of str "a" and at idx {} of str "b" the value {}'.format(idx_a, idx_b, a[idx_a]))

输出：

0.0625
[Match(a=2, b=18, size=1), Match(a=5, b=29, size=1), Match(a=32, b=32, size=0)]
------------------------------Found Match------------------------------
found at idx 2 of str "a" and at idx 18 of str "b" the value T
------------------------------Found Match------------------------------
found at idx 5 of str "a" and at idx 29 of str "b" the value 2

说明

我刚刚使用 ratio() 来查看是否存在任何相似之处。函数 get_matching_blocks() return 是一个包含字符串序列中所有匹配项的列表。我的最小解决方案不关心相同的位置，但这应该是检查索引的简单修复。在 ratio() 的 return 值等于 0.0 的情况下，匹配器不会生成空列表。该列表始终包含序列结尾的匹配项。我使用匹配的 idices 检查序列的长度。另一种解决方案是仅使用大小 > 0 的匹配项，如下所示：

if match.size > 0:
   ...

我的示例也不处理大小 > 1 的匹配项。我想您会想办法处理这个问题的；)

如何找到 python 中多个字符串的相似之处？

How do I find what is similar in multiple strings in python?

python

similarity

最小解

说明