是否有一种 Pythonic 方式来过滤列表中字符串的子字符串?

Is there a Pythonic way of filtering substrings of strings in a list?

我有一个包含以下字符串的列表。

candidates = ["Hello", "World", "HelloWorld", "Foo", "bar", "ar"]

并且我希望列表被过滤为 ["HelloWorld", "Foo", "Bar"],因为其他的都是子字符串。我可以这样做,但不要认为它很快或优雅。

def filter_not_substring(candidates):
    survive = []
    for a in candidates:
        for b in candidates:
            if a == b:
                continue
            if a in b:
                break
        else:
            survive.append(a)
    return survive

有什么快速的方法吗?

怎么样:

candidates = ["Hello", "World", "HelloWorld", "Foo", "bar", "ar"]
result = [c for c in candidates if not any(c in o and len(o) > len(c) for o in candidates)]
print(result)

与评论中的建议相反:

from timeit import timeit


def filter_not_substring(candidates):
    survive = []
    for a in candidates:
        for b in candidates:
            if a == b:
                continue
            if a in b:
                break
        else:
            survive.append(a)
    return survive


def filter_not_substring2a(candidates):
    return [c for c in candidates if not any(len(o) > len(c) and c in o for o in candidates)]


def filter_not_substring2b(candidates):
    return [c for c in candidates if not any(c in o and len(o) > len(c) for o in candidates)]


xs = ["Hello", "World", "HelloWorld", "Foo", "bar", "ar", "bar"]
print(filter_not_substring(xs), filter_not_substring2a(xs), filter_not_substring2b(xs))
print(timeit(lambda: filter_not_substring(xs)))
print(timeit(lambda: filter_not_substring2a(xs)))
print(timeit(lambda: filter_not_substring2b(xs)))

结果:

['HelloWorld', 'Foo', 'bar', 'bar'] ['HelloWorld', 'Foo', 'bar', 'bar'] ['HelloWorld', 'Foo', 'bar', 'bar']
1.5163685
4.6516653
3.8334089999999996

所以,OP 的解决方案要快得多,但是 filter_not_substring2b 仍然比 2a 快 20% 左右。因此,将 len 比较放在首位并不能节省时间。

对于任何生产场景,OP 的功能可能是最佳的 - 加速它的一种方法可能是将整个问题带到 C 中,但我怀疑这会显示出很大的收益,因为逻辑已经非常简单而且我希望 Python 也能做得相当好。

用户@ming 指出 OP 的解决方案可以改进一点:

def filter_not_substring_b(candidates):
    survive = []
    for a in candidates:
        for b in candidates:
            if a in b and a != b:
                break
        else:
            survive.append(a)
    return survive

这个版本的功能有点快,对我来说大约是10-15%

最后,注意这只比2b,尽管它与@ming 的优化解决方案非常相似,但比后者慢了近 3 倍他们的解决方案。我不清楚为什么会这样——如果有人对此有相当确定的想法,请在评论中分享:

def filter_not_substring_c(candidates):
    return [a for a in candidates if all(a not in b or a == b for b in candidates)]