比较索引位置完全重叠的名称

Question

我有一个名字列表，那些像“John Smith”和“J Smith”之类的名字想要取件。

这里的 difflib 和 .intersection 没有帮助，Levenstein 也是。如果是：

John Smith
J Smith

程序应该return“正常”。如果是：

John Smith
Jane Smith

这不是“好的”。

比较应该从下一个字符串开始 space，所以如果是：

M K Dolsen
Michael Klaus Dolsen

“还可以”。但如果是：

M L Dolsen
Michael Klaus Dolsen

这不是“好的”。

模糊正则表达式在这里也没有帮助。 python怎么解决的？

import difflib

def get_overlap(s1, s2):
    s = difflib.SequenceMatcher(None, s1, s2)
    pos_a, pos_b, size = s.find_longest_match(0, len(s1), 0, len(s2)) 
    return s1[pos_a:pos_a+size]

s1 = "John Smith"
s2 = "Jane Smith"

print(get_overlap(s1, s2))
#prints "Smith"

import jellyfish
jellyfish.damerau_levenshtein_distance('John Smith', 'J Smith')
#prints 3, while I just want to have "match ok"

Answer 1

您可以尝试以下 Pythonic 实现。幸运的是，不需要花哨的 levenshtein 距离。简而言之，按不同的词拆分，然后为每个可能匹配的部分检查是否：

两个名称部分之间的第一个字符重叠。如果没有，则有 no 匹配。（例如 John 和 Lucas、J 和 Lucas 或 J 和 L）
两个名称部分都超过1个字符，并且不同。如果是这样，则有 no 匹配。（例如 Jane 和 John）

在这些“不匹配”要求未触发的情况下，应该可能存在重叠。例如。 J 和 Jane，或 Lucas 和 Lucas。我也对代码中的一切工作方式进行了广泛的评论。

def compare(name_one: str, name_two: str) -> bool:
    # Split on spaces
    names_one = name_one.split()
    names_two = name_two.split()
    # If the names have a different number of sub-names
    if len(names_one) != len(names_two):
        return False
    
    # Iterate over the potentially matching parts of the two names
    for name_part_one, name_part_two in zip(names_one, names_two):
        # There is NO match if either the first two characters differ
        if name_part_one[0] != name_part_two[0]:
            return False
        # OR if both names are longer than 1 character *and* different
        if len(name_part_one) > 1 and len(name_part_two) > 1 and name_part_one != name_part_two:
            return False
    # Otherwise, they match
    return True

name_one = "M K Dolsen"
name_two = "Michael Klaus Dolsen"

print(compare(name_one, name_two))

此外，这里有一个快速测试套件来证明它可以正常工作：

pairs = [
    ("M K Dolsen", "Michael Klaus Dolsen"),
    ("M L Dolsen", "Michael Klaus Dolsen"),
    ("Michael K Dolsen", "Michael Klaus Dolsen"),
    ("Michael L Dolsen", "Michael Klaus Dolsen"),
    ("Michael Klaus Dolsen", "Michael Klaus Dolsen"),
    ("Michael Lucas Dolsen", "Michael Klaus Dolsen"),
    ("M K D", "M K D"),
    ("M L D", "M K D"),
    ("John Smith", "J Smith"),
    ("John Smith", "Jane Smith"),
    ("J S", "J S"),
    ("K S", "J S"),
]

for name_one, name_two in pairs:
    print(name_one, "&", name_two, "=", compare(name_one, name_two))

输出：

M K Dolsen & Michael Klaus Dolsen = True
M L Dolsen & Michael Klaus Dolsen = False
Michael K Dolsen & Michael Klaus Dolsen = True
Michael L Dolsen & Michael Klaus Dolsen = False
Michael Klaus Dolsen & Michael Klaus Dolsen = True
Michael Lucas Dolsen & Michael Klaus Dolsen = False
M K D & M K D = True
M L D & M K D = False
John Smith & J Smith = True
John Smith & Jane Smith = False
J S & J S = True
K S & J S = False

经过检查，这似乎输出了您要求的内容。

回答您在问题中的评论，该问题是：

how do you write def not for 2 lists given but for lines in a file, where argument on position 1 line1 should be compared with argument1 on position1 line2?

with open("file.txt", "r") as f:
    names = f.read().split("\n")

# names is now a list of all names, but we want the every 2 words to be compared together.
for name_one, name_two in zip(names[::2], names[1::2]):
    print(name_one, "&", name_two, "=", compare(name_one, name_two))

现在，这可能有点复杂。我会解释。 names[::2]是names[0:-1:2]的缩写，其中0是start，-1是stop，而 2 是 step。所以，这取了 names 列表的一部分，从 0 开始，到 -1 结束，取每个第二个列表元素。 names[1::2] 做同样的事情，但偏移 1。

所以，如果names是['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']，那么names[::2]就是['a', 'c', 'e', 'g']，而names[1::2]就是['b', 'd', 'f', 'h']。然后，我们使用 zip Python 函数将这两个列表“压缩”在一起。

打印 zip(names[::2], names[1::2]) 得到 [('a', 'b'), ('c', 'd'), ('e', 'f'), ('g', 'h')]，我们可以像我展示的那样用 for-loop 迭代它。

给出 file.txt 的：

M K Dolsen
Michael Klaus Dolsen
M L Dolsen
Michael Klaus Dolsen
Michael K Dolsen
Michael Klaus Dolsen
Michael L Dolsen
Michael Klaus Dolsen
Michael Klaus Dolsen
Michael Klaus Dolsen
Michael Lucas Dolsen
Michael Klaus Dolsen

程序输出：

M K Dolsen & Michael Klaus Dolsen = True
M L Dolsen & Michael Klaus Dolsen = False
Michael K Dolsen & Michael Klaus Dolsen = True
Michael L Dolsen & Michael Klaus Dolsen = False
Michael Klaus Dolsen & Michael Klaus Dolsen = True
Michael Lucas Dolsen & Michael Klaus Dolsen = False

比较索引位置完全重叠的名称

Compare names on exact overlap at indexed positions

python

character

string-matching