是否有一种 Pythonic 方式来过滤列表中字符串的子字符串?
Is there a Pythonic way of filtering substrings of strings in a list?
我有一个包含以下字符串的列表。
candidates = ["Hello", "World", "HelloWorld", "Foo", "bar", "ar"]
并且我希望列表被过滤为 ["HelloWorld", "Foo", "Bar"]
,因为其他的都是子字符串。我可以这样做,但不要认为它很快或优雅。
def filter_not_substring(candidates):
survive = []
for a in candidates:
for b in candidates:
if a == b:
continue
if a in b:
break
else:
survive.append(a)
return survive
有什么快速的方法吗?
怎么样:
candidates = ["Hello", "World", "HelloWorld", "Foo", "bar", "ar"]
result = [c for c in candidates if not any(c in o and len(o) > len(c) for o in candidates)]
print(result)
与评论中的建议相反:
from timeit import timeit
def filter_not_substring(candidates):
survive = []
for a in candidates:
for b in candidates:
if a == b:
continue
if a in b:
break
else:
survive.append(a)
return survive
def filter_not_substring2a(candidates):
return [c for c in candidates if not any(len(o) > len(c) and c in o for o in candidates)]
def filter_not_substring2b(candidates):
return [c for c in candidates if not any(c in o and len(o) > len(c) for o in candidates)]
xs = ["Hello", "World", "HelloWorld", "Foo", "bar", "ar", "bar"]
print(filter_not_substring(xs), filter_not_substring2a(xs), filter_not_substring2b(xs))
print(timeit(lambda: filter_not_substring(xs)))
print(timeit(lambda: filter_not_substring2a(xs)))
print(timeit(lambda: filter_not_substring2b(xs)))
结果:
['HelloWorld', 'Foo', 'bar', 'bar'] ['HelloWorld', 'Foo', 'bar', 'bar'] ['HelloWorld', 'Foo', 'bar', 'bar']
1.5163685
4.6516653
3.8334089999999996
所以,OP 的解决方案要快得多,但是 filter_not_substring2b
仍然比 2a
快 20% 左右。因此,将 len
比较放在首位并不能节省时间。
对于任何生产场景,OP 的功能可能是最佳的 - 加速它的一种方法可能是将整个问题带到 C 中,但我怀疑这会显示出很大的收益,因为逻辑已经非常简单而且我希望 Python 也能做得相当好。
用户@ming 指出 OP 的解决方案可以改进一点:
def filter_not_substring_b(candidates):
survive = []
for a in candidates:
for b in candidates:
if a in b and a != b:
break
else:
survive.append(a)
return survive
这个版本的功能有点快,对我来说大约是10-15%
最后,注意这只比2b
快,尽管它与@ming 的优化解决方案非常相似,但比后者慢了近 3 倍他们的解决方案。我不清楚为什么会这样——如果有人对此有相当确定的想法,请在评论中分享:
def filter_not_substring_c(candidates):
return [a for a in candidates if all(a not in b or a == b for b in candidates)]
我有一个包含以下字符串的列表。
candidates = ["Hello", "World", "HelloWorld", "Foo", "bar", "ar"]
并且我希望列表被过滤为 ["HelloWorld", "Foo", "Bar"]
,因为其他的都是子字符串。我可以这样做,但不要认为它很快或优雅。
def filter_not_substring(candidates):
survive = []
for a in candidates:
for b in candidates:
if a == b:
continue
if a in b:
break
else:
survive.append(a)
return survive
有什么快速的方法吗?
怎么样:
candidates = ["Hello", "World", "HelloWorld", "Foo", "bar", "ar"]
result = [c for c in candidates if not any(c in o and len(o) > len(c) for o in candidates)]
print(result)
与评论中的建议相反:
from timeit import timeit
def filter_not_substring(candidates):
survive = []
for a in candidates:
for b in candidates:
if a == b:
continue
if a in b:
break
else:
survive.append(a)
return survive
def filter_not_substring2a(candidates):
return [c for c in candidates if not any(len(o) > len(c) and c in o for o in candidates)]
def filter_not_substring2b(candidates):
return [c for c in candidates if not any(c in o and len(o) > len(c) for o in candidates)]
xs = ["Hello", "World", "HelloWorld", "Foo", "bar", "ar", "bar"]
print(filter_not_substring(xs), filter_not_substring2a(xs), filter_not_substring2b(xs))
print(timeit(lambda: filter_not_substring(xs)))
print(timeit(lambda: filter_not_substring2a(xs)))
print(timeit(lambda: filter_not_substring2b(xs)))
结果:
['HelloWorld', 'Foo', 'bar', 'bar'] ['HelloWorld', 'Foo', 'bar', 'bar'] ['HelloWorld', 'Foo', 'bar', 'bar']
1.5163685
4.6516653
3.8334089999999996
所以,OP 的解决方案要快得多,但是 filter_not_substring2b
仍然比 2a
快 20% 左右。因此,将 len
比较放在首位并不能节省时间。
对于任何生产场景,OP 的功能可能是最佳的 - 加速它的一种方法可能是将整个问题带到 C 中,但我怀疑这会显示出很大的收益,因为逻辑已经非常简单而且我希望 Python 也能做得相当好。
用户@ming 指出 OP 的解决方案可以改进一点:
def filter_not_substring_b(candidates):
survive = []
for a in candidates:
for b in candidates:
if a in b and a != b:
break
else:
survive.append(a)
return survive
这个版本的功能有点快,对我来说大约是10-15%
最后,注意这只比2b
快,尽管它与@ming 的优化解决方案非常相似,但比后者慢了近 3 倍他们的解决方案。我不清楚为什么会这样——如果有人对此有相当确定的想法,请在评论中分享:
def filter_not_substring_c(candidates):
return [a for a in candidates if all(a not in b or a == b for b in candidates)]