通过组合列表中的元组子集来获得满足位置标准的所有元组序列
Obtaining all sequences of tuples satisfying position criteria by combining subsets of tuples in a list
我有一个元组列表,每个元组都包含以下信息:
(start_position,end_position,list of strings)
示例列表如下:
aList = [(6, 9, ['ataH']),
(4, 9, ['svataH']),
(0, 9, ['vEvasvataH']),
(2, 5, ['vasu', 'vasU']),
(1, 3, ['Eva', 'eva']),
(0, 1, ['vA', 'vE'])]
我需要找到所有元组序列,使得每个序列必须覆盖从 start_position
到 end_position
的所有位置,在本例中是从 0
到 9
.在一个序列中,比如 a
,相邻的元组需要满足 a[i+1][0] - a[i][1] <= 1
.
的约束
总的来说,输出应该如下:
[[(0, 1, ['vA', 'vE']), (2,5,['vasu', 'vasU']), (6, 9, ['ataH']) ],
[(0, 1, ['vA', 'vE']), (1, 3, ['Eva', 'eva']), (4, 9, ['svataH'])],
[(0, 9, ['vEvasvataH'], [7])]]
我用下面的代码实现了同样的效果。
maxVal = max(aList,key=lambda item:item[1])[1]
diffSets = list()
for i,item in enumerate(aList):
if maxVal == item[1]:
_temp = [item]
currStart = item[0]
for j,stuff in enumerate(aList):
if i != j:
if currStart == stuff[1] or currStart == stuff[1]+1:
_temp.append(stuff)
currStart = stuff[0]
diffSets.append(_temp) #the output needs to be reversed to get the sequence in the order as shown above
是否有更有效、更快速的方法来实现相同的目标,比如使用 itertools
?
以下是我的处理方式(并不是说它一定会更快)。首先,您可以根据开始和结束对数据进行排序。这意味着当我们稍后查看组合时,我们不必回溯到结果中的其他值(我们将知道 entry[i]
的开始必须小于或等于 entry[i+1]
的开始)
import itertools
import operator
data = [
(6, 9, ['ataH']),
(4, 9, ['svataH']),
(0, 9, ['vEvasvataH']),
(2, 5, ['vasu', 'vasU']),
(1, 3, ['Eva', 'eva']),
(0, 1, ['vA', 'vE'])
]
data = sorted(data, key=operator.itemgetter(0, 1))
start = min(data, key=operator.itemgetter(0))[0]
end = max(data, key=operator.itemgetter(1))[1]
现在我们对数据进行了排序,并且知道了我们的开始值和结束值。
要找到任意大小子集的所有数据组合,我们可以使用这个技巧:
def all_combinations(data):
for i in range(len(data)):
yield from itertools.combinations(data, r=i+1)
这里我们使用yield
来避免创建昂贵的容器。现在我们编写另一个使用这些组合的方法,并检查每个组合的有效性:
def valid_combinations(data):
for comb in all_combinations(data):
if comb[0][0] != start or comb[-1][1] != end:
continue
for i in range(len(comb) - 1):
if not 0 <= comb[i+1][0] - comb[i][1] <= 1:
break
else:
yield comb
这里我使用了一个巧妙的 for 循环技巧。循环中的 else
块只有在 for 循环自然完成并且没有中断时才会执行,如果我们不中断,那么我们就知道每个项目都是有效的。
总计:
import itertools
import operator
def all_combinations(data):
for i in range(len(data)):
yield from itertools.combinations(data, r=i+1)
def valid_combinations(data):
data = sorted(data, key=operator.itemgetter(0, 1))
start = min(data, key=operator.itemgetter(0))[0]
end = max(data, key=operator.itemgetter(1))[1]
for comb in all_combinations(data):
if comb[0][0] != start or comb[-1][1] != end:
continue
for i in range(len(comb) - 1):
if not 0 <= comb[i+1][0] - comb[i][1] <= 1:
break
else:
yield comb
获取结果:
from pprint import pprint
pprint(list(valid_combinations(
[
(6, 9, ['ataH']),
(4, 9, ['svataH']),
(0, 9, ['vEvasvataH']),
(2, 5, ['vasu', 'vasU']),
(1, 3, ['Eva', 'eva']),
(0, 1, ['vA', 'vE'])
]
)))
[((0, 9, ['vEvasvataH']),),
((0, 1, ['vA', 'vE']), (1, 3, ['Eva', 'eva']), (4, 9, ['svataH'])),
((0, 1, ['vA', 'vE']), (2, 5, ['vasu', 'vasU']), (6, 9, ['ataH']))]
让我们假设您的 none 个节点允许完全重叠。您可以设置节点之间的连接图,并将其传递给类似 networkx 的东西,它实现了您感兴趣的可接受的最佳搜索算法。
您有一个 DAG,其连接由一个节点的末尾和下一个节点的开始之间的重叠表示。以下是填充图表的方法:
import networkx as nx
g = nx.DiGraph()
for i, n in enumerate(aList):
g.add_node(i, start=n[0], end=n[1], data=n[2])
节点是aList
中数据的索引,数据存储为属性。您可以在此处添加边。这是 O(n<sup>2</sup>)
,但如果您对数据进行适当的预排序,可以减少到 O(n log n)
:
for i, m in enumerate(aList):
for j, n in enumerate(aList):
if i != j and m[0] < n[0] and n[0] - m[1] <= 1 and m[1] < n[1]:
g.add_edge(i, j)
现在您可以找到跨越整个图形的路径。首先确定起点:
from operator import itemgetter
minValue = min(aList, key=itemgetter(0))[0]
maxValue = max(aList, key=itemgetter(1))[1]
start = [i for i, n in enumerate(aList) if n[0] == minValue]
end = [i for i, n in enumerate(aList) if n[1] == maxValue]
可以使用 nx.all_simple_paths
找到路径,其中包括:
paths = []
for node in start:
if node in end:
paths.append([node])
else:
paths.extend(nx.all_simple_paths(g, node, end))
现在你有一个这样的列表:
[[2], [5, 3, 0], [5, 3, 1], [5, 4, 1], [5, 4, 3, 0], [5, 4, 3, 1]]
您可以使用结果中的索引对原始列表进行采样,或使用存储在图表中的元数据,具体取决于您的喜好。这是前一种方法:
result = [[aList[item] for item in path] for path in paths]
最终结果如下所示:
[[(0, 9, ['vEvasvataH'])],
[(0, 1, ['vA', 'vE']), (2, 5, ['vasu', 'vasU']), (6, 9, ['ataH'])],
[(0, 1, ['vA', 'vE']), (2, 5, ['vasu', 'vasU']), (4, 9, ['svataH'])], # Do you want this (5 > 4)?
[(0, 1, ['vA', 'vE']), (1, 3, ['Eva', 'eva']), (4, 9, ['svataH'])],
[(0, 1, ['vA', 'vE']), (1, 3, ['Eva', 'eva']), (2, 5, ['vasu', 'vasU']), (6, 9, ['ataH'])], # Do you want this (3 > 2)?
[(0, 1, ['vA', 'vE']), (1, 3, ['Eva', 'eva']), (2, 5, ['vasu', 'vasU']), (4, 9, ['svataH'])]] # Do you want this (3 > 2), (5 > 4)?
由于路径是有序的,因此很容易立即看出它们是如何跨越间隔的。
这是一个替代解决方案,它使用以 start_position
为关键字的字典来加快搜索速度。
import operator as op
import itertools as it
from collections import defaultdict
# These two accessor functions make it more legible to work with the tuples
start_position = op.itemgetter(0)
end_position = op.itemgetter(1)
def find_all_paths(list_of_tuples):
last_position = end_position(max(list_of_tuples, key=end_position))
tuples_dict = defaultdict(list)
for t in list_of_tuples:
tuples_dict[start_position(t)].append(t)
def all_paths_starting_from(next_value):
candidates = it.chain(tuples_dict[next_value], tuples_dict[next_value+1])
paths = []
for candidate in candidates:
if end_position(candidate) == last_position:
paths.append([candidate])
else:
for branch in all_paths_starting_from(end_position(candidate)):
paths.append([candidate] + branch)
return paths
return all_paths_starting_from(-1)
这个想法是,在一个大图中,最大的时间消耗者将搜索给定节点的下一个可能节点。 tuples_dict
将这个时间减少到本质上 O(1)。此外,递归的最大深度将为 len(tuples_dict)
.
如果输入集很大并且图形有很多分支和合并,all_paths_starting_from
函数可以被记忆以提高性能(但代码变得不太清晰)。记忆解决方案将是:
def find_all_paths(list_of_tuples):
last_position = end_position(max(list_of_tuples, key=end_position))
tuples_dict = defaultdict(list)
for t in list_of_tuples:
tuples_dict[start_position(t)].append(t)
known_paths = {} # Dict from next_value to the possible paths
def all_paths_starting_from(next_value):
if next_value in known_paths:
return known_paths[next_value]
candidates = it.chain(tuples_dict[next_value], tuples_dict[next_value+1])
paths = []
for candidate in candidates:
if end_position(candidate) == last_position:
paths.append([candidate])
else:
for branch in all_paths_starting_from(end_position(candidate)):
paths.append([candidate] + branch)
known_paths[next_value] = paths
return paths
return all_paths_starting_from(-1)
这里是一个带有更多分支和合并(和死胡同)的图表示例:
test_input = [(0, 2, ['1']),
(0, 3, ['2']),
(0, 5, ['3']),
(2, 7, ['4']),
(3, 8, ['5']),
(4, 9, ['6']),
(8, 13, ['7']),
(9, 14, ['8']),
(10, 11, ['9']),
(14, 15, ['10']),
]
from pprint import pprint
pprint(find_all_paths(test_input))
结果是:
[(0, 2, ['1']), (2, 7, ['4']), (8, 13, ['7']), (14, 15, ['10'])],
[(0, 2, ['1']), (3, 8, ['5']), (8, 13, ['7']), (14, 15, ['10'])],
[(0, 2, ['1']), (3, 8, ['5']), (9, 14, ['8']), (14, 15, ['10'])],
[(0, 3, ['2']), (3, 8, ['5']), (8, 13, ['7']), (14, 15, ['10'])],
[(0, 3, ['2']), (3, 8, ['5']), (9, 14, ['8']), (14, 15, ['10'])],
[(0, 3, ['2']), (4, 9, ['6']), (9, 14, ['8']), (14, 15, ['10'])]]
我有一个元组列表,每个元组都包含以下信息:
(start_position,end_position,list of strings)
示例列表如下:
aList = [(6, 9, ['ataH']),
(4, 9, ['svataH']),
(0, 9, ['vEvasvataH']),
(2, 5, ['vasu', 'vasU']),
(1, 3, ['Eva', 'eva']),
(0, 1, ['vA', 'vE'])]
我需要找到所有元组序列,使得每个序列必须覆盖从 start_position
到 end_position
的所有位置,在本例中是从 0
到 9
.在一个序列中,比如 a
,相邻的元组需要满足 a[i+1][0] - a[i][1] <= 1
.
总的来说,输出应该如下:
[[(0, 1, ['vA', 'vE']), (2,5,['vasu', 'vasU']), (6, 9, ['ataH']) ],
[(0, 1, ['vA', 'vE']), (1, 3, ['Eva', 'eva']), (4, 9, ['svataH'])],
[(0, 9, ['vEvasvataH'], [7])]]
我用下面的代码实现了同样的效果。
maxVal = max(aList,key=lambda item:item[1])[1]
diffSets = list()
for i,item in enumerate(aList):
if maxVal == item[1]:
_temp = [item]
currStart = item[0]
for j,stuff in enumerate(aList):
if i != j:
if currStart == stuff[1] or currStart == stuff[1]+1:
_temp.append(stuff)
currStart = stuff[0]
diffSets.append(_temp) #the output needs to be reversed to get the sequence in the order as shown above
是否有更有效、更快速的方法来实现相同的目标,比如使用 itertools
?
以下是我的处理方式(并不是说它一定会更快)。首先,您可以根据开始和结束对数据进行排序。这意味着当我们稍后查看组合时,我们不必回溯到结果中的其他值(我们将知道 entry[i]
的开始必须小于或等于 entry[i+1]
的开始)
import itertools
import operator
data = [
(6, 9, ['ataH']),
(4, 9, ['svataH']),
(0, 9, ['vEvasvataH']),
(2, 5, ['vasu', 'vasU']),
(1, 3, ['Eva', 'eva']),
(0, 1, ['vA', 'vE'])
]
data = sorted(data, key=operator.itemgetter(0, 1))
start = min(data, key=operator.itemgetter(0))[0]
end = max(data, key=operator.itemgetter(1))[1]
现在我们对数据进行了排序,并且知道了我们的开始值和结束值。
要找到任意大小子集的所有数据组合,我们可以使用这个技巧:
def all_combinations(data):
for i in range(len(data)):
yield from itertools.combinations(data, r=i+1)
这里我们使用yield
来避免创建昂贵的容器。现在我们编写另一个使用这些组合的方法,并检查每个组合的有效性:
def valid_combinations(data):
for comb in all_combinations(data):
if comb[0][0] != start or comb[-1][1] != end:
continue
for i in range(len(comb) - 1):
if not 0 <= comb[i+1][0] - comb[i][1] <= 1:
break
else:
yield comb
这里我使用了一个巧妙的 for 循环技巧。循环中的 else
块只有在 for 循环自然完成并且没有中断时才会执行,如果我们不中断,那么我们就知道每个项目都是有效的。
总计:
import itertools
import operator
def all_combinations(data):
for i in range(len(data)):
yield from itertools.combinations(data, r=i+1)
def valid_combinations(data):
data = sorted(data, key=operator.itemgetter(0, 1))
start = min(data, key=operator.itemgetter(0))[0]
end = max(data, key=operator.itemgetter(1))[1]
for comb in all_combinations(data):
if comb[0][0] != start or comb[-1][1] != end:
continue
for i in range(len(comb) - 1):
if not 0 <= comb[i+1][0] - comb[i][1] <= 1:
break
else:
yield comb
获取结果:
from pprint import pprint
pprint(list(valid_combinations(
[
(6, 9, ['ataH']),
(4, 9, ['svataH']),
(0, 9, ['vEvasvataH']),
(2, 5, ['vasu', 'vasU']),
(1, 3, ['Eva', 'eva']),
(0, 1, ['vA', 'vE'])
]
)))
[((0, 9, ['vEvasvataH']),),
((0, 1, ['vA', 'vE']), (1, 3, ['Eva', 'eva']), (4, 9, ['svataH'])),
((0, 1, ['vA', 'vE']), (2, 5, ['vasu', 'vasU']), (6, 9, ['ataH']))]
让我们假设您的 none 个节点允许完全重叠。您可以设置节点之间的连接图,并将其传递给类似 networkx 的东西,它实现了您感兴趣的可接受的最佳搜索算法。
您有一个 DAG,其连接由一个节点的末尾和下一个节点的开始之间的重叠表示。以下是填充图表的方法:
import networkx as nx
g = nx.DiGraph()
for i, n in enumerate(aList):
g.add_node(i, start=n[0], end=n[1], data=n[2])
节点是aList
中数据的索引,数据存储为属性。您可以在此处添加边。这是 O(n<sup>2</sup>)
,但如果您对数据进行适当的预排序,可以减少到 O(n log n)
:
for i, m in enumerate(aList):
for j, n in enumerate(aList):
if i != j and m[0] < n[0] and n[0] - m[1] <= 1 and m[1] < n[1]:
g.add_edge(i, j)
现在您可以找到跨越整个图形的路径。首先确定起点:
from operator import itemgetter
minValue = min(aList, key=itemgetter(0))[0]
maxValue = max(aList, key=itemgetter(1))[1]
start = [i for i, n in enumerate(aList) if n[0] == minValue]
end = [i for i, n in enumerate(aList) if n[1] == maxValue]
可以使用 nx.all_simple_paths
找到路径,其中包括:
paths = []
for node in start:
if node in end:
paths.append([node])
else:
paths.extend(nx.all_simple_paths(g, node, end))
现在你有一个这样的列表:
[[2], [5, 3, 0], [5, 3, 1], [5, 4, 1], [5, 4, 3, 0], [5, 4, 3, 1]]
您可以使用结果中的索引对原始列表进行采样,或使用存储在图表中的元数据,具体取决于您的喜好。这是前一种方法:
result = [[aList[item] for item in path] for path in paths]
最终结果如下所示:
[[(0, 9, ['vEvasvataH'])],
[(0, 1, ['vA', 'vE']), (2, 5, ['vasu', 'vasU']), (6, 9, ['ataH'])],
[(0, 1, ['vA', 'vE']), (2, 5, ['vasu', 'vasU']), (4, 9, ['svataH'])], # Do you want this (5 > 4)?
[(0, 1, ['vA', 'vE']), (1, 3, ['Eva', 'eva']), (4, 9, ['svataH'])],
[(0, 1, ['vA', 'vE']), (1, 3, ['Eva', 'eva']), (2, 5, ['vasu', 'vasU']), (6, 9, ['ataH'])], # Do you want this (3 > 2)?
[(0, 1, ['vA', 'vE']), (1, 3, ['Eva', 'eva']), (2, 5, ['vasu', 'vasU']), (4, 9, ['svataH'])]] # Do you want this (3 > 2), (5 > 4)?
由于路径是有序的,因此很容易立即看出它们是如何跨越间隔的。
这是一个替代解决方案,它使用以 start_position
为关键字的字典来加快搜索速度。
import operator as op
import itertools as it
from collections import defaultdict
# These two accessor functions make it more legible to work with the tuples
start_position = op.itemgetter(0)
end_position = op.itemgetter(1)
def find_all_paths(list_of_tuples):
last_position = end_position(max(list_of_tuples, key=end_position))
tuples_dict = defaultdict(list)
for t in list_of_tuples:
tuples_dict[start_position(t)].append(t)
def all_paths_starting_from(next_value):
candidates = it.chain(tuples_dict[next_value], tuples_dict[next_value+1])
paths = []
for candidate in candidates:
if end_position(candidate) == last_position:
paths.append([candidate])
else:
for branch in all_paths_starting_from(end_position(candidate)):
paths.append([candidate] + branch)
return paths
return all_paths_starting_from(-1)
这个想法是,在一个大图中,最大的时间消耗者将搜索给定节点的下一个可能节点。 tuples_dict
将这个时间减少到本质上 O(1)。此外,递归的最大深度将为 len(tuples_dict)
.
如果输入集很大并且图形有很多分支和合并,all_paths_starting_from
函数可以被记忆以提高性能(但代码变得不太清晰)。记忆解决方案将是:
def find_all_paths(list_of_tuples):
last_position = end_position(max(list_of_tuples, key=end_position))
tuples_dict = defaultdict(list)
for t in list_of_tuples:
tuples_dict[start_position(t)].append(t)
known_paths = {} # Dict from next_value to the possible paths
def all_paths_starting_from(next_value):
if next_value in known_paths:
return known_paths[next_value]
candidates = it.chain(tuples_dict[next_value], tuples_dict[next_value+1])
paths = []
for candidate in candidates:
if end_position(candidate) == last_position:
paths.append([candidate])
else:
for branch in all_paths_starting_from(end_position(candidate)):
paths.append([candidate] + branch)
known_paths[next_value] = paths
return paths
return all_paths_starting_from(-1)
这里是一个带有更多分支和合并(和死胡同)的图表示例:
test_input = [(0, 2, ['1']),
(0, 3, ['2']),
(0, 5, ['3']),
(2, 7, ['4']),
(3, 8, ['5']),
(4, 9, ['6']),
(8, 13, ['7']),
(9, 14, ['8']),
(10, 11, ['9']),
(14, 15, ['10']),
]
from pprint import pprint
pprint(find_all_paths(test_input))
结果是:
[(0, 2, ['1']), (2, 7, ['4']), (8, 13, ['7']), (14, 15, ['10'])],
[(0, 2, ['1']), (3, 8, ['5']), (8, 13, ['7']), (14, 15, ['10'])],
[(0, 2, ['1']), (3, 8, ['5']), (9, 14, ['8']), (14, 15, ['10'])],
[(0, 3, ['2']), (3, 8, ['5']), (8, 13, ['7']), (14, 15, ['10'])],
[(0, 3, ['2']), (3, 8, ['5']), (9, 14, ['8']), (14, 15, ['10'])],
[(0, 3, ['2']), (4, 9, ['6']), (9, 14, ['8']), (14, 15, ['10'])]]