查找 "random" 字符串列表中至少 2 个元素的最长前缀

Finding longest prefix present on at least 2 elements on a list of "random" strings

给定一个字符串列表,例如:

myList = ["foo", "foobar", "football", "footbag", "bar"]

找到列表中至少 2 个字符串中出现的最长前缀:

Longest prefix is "footba" present in "football" and "footbag"

列表将通过输入来填充,并不是所有的输入都有一个共同的前缀。

要被视为一个选项,前缀出现在列表中的两个字符串上就足够了。如果有多个选项,它必须return最长的一个。

在我的研究中,我已经能够找到如何获得所有字符串的最长公共前缀,例如:

列表:["foo_a","foo_b","foo_c","fnord"]

输出:Longest common prefix is "f"

但是,我列表中的字符串甚至可能不是以相同的字母开头。

您可以构建 prefix trie 的森林,然后搜索具有两个(非空)子节点的最深节点的 "height"(距离根有多远)。该节点代表最长公共前缀。

如果你不关心性能,你可以简单地迭代列表中的所有单词,并将它们中的每一个(它的前缀)与其余的进行比较,同时不断更新最大值:

def common_prefix_size(s1, s2):
    res, i = 0, 0
    while i < min(len(s1), len(s2)):
        if s1[i] == s2[i]:
            res += 1
            i += 1
        else:
            break
    return res



def longest_prefix(lst):
    res = ''
    maxsize = 0
    for i in range(len(lst) - 1):
        for j in range(i + 1, len(lst)):
            t = common_prefix_size(lst[i], lst[j])
            maxsize = max(maxsize, t)
            if maxsize == t:
                res = lst[i][:maxsize]
    return res

myList = ["foo", "foobar", "football", "footbag", "bar"]

print(longest_prefix(myList)) # footba

这是一个混乱的实现,对于大型列表来说效率不高,但它可以完成工作。我建议查看提到的前缀尝试,但如果您有一点时间,它们会工作得更好。

这从全尺寸单词向后工作,直到两个单词共享相同的前缀。它会截断单词的末尾并计算同一个单词出现的次数,如果至少出现 2 次,则 returns。

from collections import defaultdict

def list_to_text(x):
    x = list(map(str, x))
    first = '", "'.join(x[:-1]) #Join together (a, b, c)
    if first:
        return '" and "'.join((first, x[-1])) #Add the last element (a, b, c and d)
    return x[0] #Return a single value if list length is 1

def find_longest_prefix(x):
    x_max = len(max(x, key=len))
    for i in range(1, x_max)[::-1]:

        #Chop off the end of every word
        trim = [j[:i] for j in x]

        #Iterate through every unique value
        result = defaultdict(list)
        for j in set(trim):
            result[trim.count(j)].append(j)

        #Finish iterating if there are more than 2 words that share a prefix
        highest_count = max(result)
        if highest_count >= 2:
            prefix = result[highest_count]
            words = [j for k in prefix for j in x if j.startswith(k)]
            return prefix, words

myList = ["foo", "foobar", "football", "footbag", "bar"]
prefix, words = find_longest_prefix(myList)

#Put together the string
print('The longest common prefix{} "{}" present in "{}".'.format(' is' if len(prefix) ==1 else 'es are',
                                                                 list_to_text(prefix), list_to_text(words)))

它会根据结果的数量对字符串进行一些格式化。您的列表仍将导致:

The longest common prefix is "footba" present in "football" and "footbag".

但是添加另一个具有相同长度和结果数量的前缀将导致如下结果:

The longest common prefixes are "footba" and "testin" present in "football", "footbag", "testing" and "testin".