在给定条件下查找列表中所有对的最佳方法是什么?
What is the best way to find all pairs in a list with a given condition?
假设我有一个列表 ['ab','cd','ac','de']
,我想找到所有可能的对,其中两个元素都没有任何重复字母。
答案是 ['ab','cd'],['ab','de'],['ac','de']
。
但我不知道 way/algorithm 最好的解决方法。目前我正在做的是使用嵌套的 for
循环来检查每个元素,然后将其删除以缩短列表,但是当我的列表长 1000 行并且元素更复杂时,它开始变得明显时间。
arr4 = ['ab','cd','ac','de']
for i in arr4:
for x in arr4:
if not bool(set(i)&set(x)):
print('['+i+','+x+']')
print('----')
arr4.remove(i)
我认为您可以跟踪每个数组元素的字母集。然后,存储反转对(例如,(x,i))不再访问条目如下:
if __name__ == '__main__':
arr4 = ['ab','cd','ac','de']
string_dict = {}
for a_str in arr4:
if a_str not in string_dict:
string_dict[a_str] = set()
for a_letter in a_str:
string_dict[a_str].add(a_letter)
print(string_dict)
visited = set()
for i in arr4:
for x in arr4:
if not (string_dict[i] & string_dict[x]):
cur_pair = (i, x)
if cur_pair in visited:
continue
print(cur_pair)
reverse_pair = (x, i)
visited.add(cur_pair)
visited.add(reverse_pair)
结果:
{'ab': {'a', 'b'}, 'cd': {'c', 'd'}, 'ac': {'a', 'c'}, 'de': {'d', 'e'}}
('ab', 'cd')
('ab', 'de')
('ac', 'de')
就基本算法而言,您的解决方案还不错,但您不应该像现在这样修改要迭代的列表。此外,还有一些优化可能:
from random import sample
from timeit import timeit
from itertools import combinations
size = 3 # length of the strings
chars = [chr(n) for n in range(97, 123)]
strings = {''.join(sample(chars, size)) for _ in range(1000)}
# your solution as a one-liner
def pair_strs1(ps):
return [(p, q) for p in ps for q in ps if not (set(p) & set(q))]
# this doesn't work, unless there's no initial pairs with double letters like 'aa'
def pair_strs2(ps):
return [(p, q) for p in ps for q in ps if len(set(p+q)) == 4]
# your solution, but compute the sets only once, then look them up before combining
def pair_strs3(ps):
ps = {p: set(p) for p in ps}
return [(p, q) for p in ps for q in ps if not (ps[p] & ps[q])]
# same, but avoiding mirrored pairs, only compute above the diagonal of the product
def pair_strs4(ps):
ps = {p: set(p) for p in ps}
keys = list(ps.keys())
return [(p, q) for i, p in enumerate(ps) for q in keys[i+1:] if not (ps[p] & ps[q])]
# the same again, but now avoiding the lookup in the dictionaries
def pair_strs5(ps):
ps = {p: set(p) for p in ps}
items = list(ps.items())
return [(p, q) for i, (p, p_set) in enumerate(ps.items()) for q, q_set in items[i+1:] if not (p_set & q_set)]
# different approach, creating the product (upper half above diagonal) first
def pair_strs6(ps):
ps = {p: set(p) for p in ps}
ps = combinations(ps.items(), 2)
return [(p, q) for (p, p_set), (q, q_set) in ps if not (p_set & q_set)]
# the same as 5, but with integer bitmasks instead of sets, as per @kellybundy's suggestion
def pair_strs7(ps):
ps = {p: sum(1 << (ord(c) - 97) for c in p) for p in ps}
items = list(ps.items())
return [(p, q) for i, (p, p_mask) in enumerate(ps.items()) for q, q_mask in items[i+1:] if not (p_mask & q_mask)]
# run some tests
n = 10
print(timeit(lambda: pair_strs1(strings), number=n))
print(timeit(lambda: pair_strs3(strings), number=n))
print(timeit(lambda: pair_strs4(strings), number=n))
print(timeit(lambda: pair_strs5(strings), number=n))
print(timeit(lambda: pair_strs6(strings), number=n))
print(timeit(lambda: pair_strs7(strings), number=n))
p1 = pair_strs1(strings)
p3 = pair_strs3(strings)
p4 = pair_strs4(strings)
p4 += [tuple(reversed(p)) for p in p4]
p5 = pair_strs5(strings)
p5 += [tuple(reversed(p)) for p in p5]
p6 = pair_strs6(strings)
p6 += [tuple(reversed(p)) for p in p6]
p7 = pair_strs7(strings)
p7 += [tuple(reversed(p)) for p in p7]
print(set(p1) == set(p3) == set(p4) == set(p5) == set(p6) == set(p7) and
len(p1) == len(p3) == len(p4) == len(p5) == len(p6) == len(p7))
我的结果:
3.4115294000002905
1.7107413000012457
0.8492871999987983
0.7485178000024462
0.7776397999987239
0.37149049999788986
True
因此,最后的更改并没有真正帮助 - 预先计算乘积并不能节省足够的钱。 pair_pairs5
似乎更可取。
编辑:将字符串的生成更改为包括一个大小(并将默认值设置为 3 而不是 2),相应地调整命名并避免字符串中的重复字符(根据 OP 的描述)。
这也使用户@KellyBundy 的解决方案成为最快的解决方案,因为没有重复项允许非常干净的 integer-based 解决方案,#7。在这些条件下,这是明显的赢家(大约 9 倍 speed-up)。
假设我有一个列表 ['ab','cd','ac','de']
,我想找到所有可能的对,其中两个元素都没有任何重复字母。
答案是 ['ab','cd'],['ab','de'],['ac','de']
。
但我不知道 way/algorithm 最好的解决方法。目前我正在做的是使用嵌套的 for
循环来检查每个元素,然后将其删除以缩短列表,但是当我的列表长 1000 行并且元素更复杂时,它开始变得明显时间。
arr4 = ['ab','cd','ac','de']
for i in arr4:
for x in arr4:
if not bool(set(i)&set(x)):
print('['+i+','+x+']')
print('----')
arr4.remove(i)
我认为您可以跟踪每个数组元素的字母集。然后,存储反转对(例如,(x,i))不再访问条目如下:
if __name__ == '__main__':
arr4 = ['ab','cd','ac','de']
string_dict = {}
for a_str in arr4:
if a_str not in string_dict:
string_dict[a_str] = set()
for a_letter in a_str:
string_dict[a_str].add(a_letter)
print(string_dict)
visited = set()
for i in arr4:
for x in arr4:
if not (string_dict[i] & string_dict[x]):
cur_pair = (i, x)
if cur_pair in visited:
continue
print(cur_pair)
reverse_pair = (x, i)
visited.add(cur_pair)
visited.add(reverse_pair)
结果:
{'ab': {'a', 'b'}, 'cd': {'c', 'd'}, 'ac': {'a', 'c'}, 'de': {'d', 'e'}}
('ab', 'cd')
('ab', 'de')
('ac', 'de')
就基本算法而言,您的解决方案还不错,但您不应该像现在这样修改要迭代的列表。此外,还有一些优化可能:
from random import sample
from timeit import timeit
from itertools import combinations
size = 3 # length of the strings
chars = [chr(n) for n in range(97, 123)]
strings = {''.join(sample(chars, size)) for _ in range(1000)}
# your solution as a one-liner
def pair_strs1(ps):
return [(p, q) for p in ps for q in ps if not (set(p) & set(q))]
# this doesn't work, unless there's no initial pairs with double letters like 'aa'
def pair_strs2(ps):
return [(p, q) for p in ps for q in ps if len(set(p+q)) == 4]
# your solution, but compute the sets only once, then look them up before combining
def pair_strs3(ps):
ps = {p: set(p) for p in ps}
return [(p, q) for p in ps for q in ps if not (ps[p] & ps[q])]
# same, but avoiding mirrored pairs, only compute above the diagonal of the product
def pair_strs4(ps):
ps = {p: set(p) for p in ps}
keys = list(ps.keys())
return [(p, q) for i, p in enumerate(ps) for q in keys[i+1:] if not (ps[p] & ps[q])]
# the same again, but now avoiding the lookup in the dictionaries
def pair_strs5(ps):
ps = {p: set(p) for p in ps}
items = list(ps.items())
return [(p, q) for i, (p, p_set) in enumerate(ps.items()) for q, q_set in items[i+1:] if not (p_set & q_set)]
# different approach, creating the product (upper half above diagonal) first
def pair_strs6(ps):
ps = {p: set(p) for p in ps}
ps = combinations(ps.items(), 2)
return [(p, q) for (p, p_set), (q, q_set) in ps if not (p_set & q_set)]
# the same as 5, but with integer bitmasks instead of sets, as per @kellybundy's suggestion
def pair_strs7(ps):
ps = {p: sum(1 << (ord(c) - 97) for c in p) for p in ps}
items = list(ps.items())
return [(p, q) for i, (p, p_mask) in enumerate(ps.items()) for q, q_mask in items[i+1:] if not (p_mask & q_mask)]
# run some tests
n = 10
print(timeit(lambda: pair_strs1(strings), number=n))
print(timeit(lambda: pair_strs3(strings), number=n))
print(timeit(lambda: pair_strs4(strings), number=n))
print(timeit(lambda: pair_strs5(strings), number=n))
print(timeit(lambda: pair_strs6(strings), number=n))
print(timeit(lambda: pair_strs7(strings), number=n))
p1 = pair_strs1(strings)
p3 = pair_strs3(strings)
p4 = pair_strs4(strings)
p4 += [tuple(reversed(p)) for p in p4]
p5 = pair_strs5(strings)
p5 += [tuple(reversed(p)) for p in p5]
p6 = pair_strs6(strings)
p6 += [tuple(reversed(p)) for p in p6]
p7 = pair_strs7(strings)
p7 += [tuple(reversed(p)) for p in p7]
print(set(p1) == set(p3) == set(p4) == set(p5) == set(p6) == set(p7) and
len(p1) == len(p3) == len(p4) == len(p5) == len(p6) == len(p7))
我的结果:
3.4115294000002905
1.7107413000012457
0.8492871999987983
0.7485178000024462
0.7776397999987239
0.37149049999788986
True
因此,最后的更改并没有真正帮助 - 预先计算乘积并不能节省足够的钱。 pair_pairs5
似乎更可取。
编辑:将字符串的生成更改为包括一个大小(并将默认值设置为 3 而不是 2),相应地调整命名并避免字符串中的重复字符(根据 OP 的描述)。
这也使用户@KellyBundy 的解决方案成为最快的解决方案,因为没有重复项允许非常干净的 integer-based 解决方案,#7。在这些条件下,这是明显的赢家(大约 9 倍 speed-up)。