通过组合列表中的元组子集来获得满足位置标准的所有元组序列

Obtaining all sequences of tuples satisfying position criteria by combining subsets of tuples in a list

我有一个元组列表,每个元组都包含以下信息:

(start_position,end_position,list of strings)

示例列表如下:

aList = [(6, 9, ['ataH']), 
(4, 9, ['svataH']), 
(0, 9, ['vEvasvataH']), 
(2, 5, ['vasu', 'vasU']), 
(1, 3, ['Eva', 'eva']), 
(0, 1, ['vA', 'vE'])]

我需要找到所有元组序列,使得每个序列必须覆盖从 start_positionend_position 的所有位置,在本例中是从 09.在一个序列中,比如 a,相邻的元组需要满足 a[i+1][0] - a[i][1] <= 1.

的约束

总的来说,输出应该如下:

[[(0, 1, ['vA', 'vE']), (2,5,['vasu', 'vasU']), (6, 9, ['ataH'])  ],
 [(0, 1, ['vA', 'vE']), (1, 3, ['Eva', 'eva']), (4, 9, ['svataH'])],
 [(0, 9, ['vEvasvataH'], [7])]]

我用下面的代码实现了同样的效果。

maxVal = max(aList,key=lambda item:item[1])[1]
diffSets = list()

for i,item in enumerate(aList):
    if maxVal == item[1]:  
        _temp = [item]        
        currStart = item[0]
        for j,stuff in enumerate(aList):
            if i != j:
                if currStart == stuff[1] or currStart == stuff[1]+1:                    
                    _temp.append(stuff)
                    currStart = stuff[0]        
        diffSets.append(_temp) #the output needs to be reversed to get the sequence in the order as shown above

是否有更有效、更快速的方法来实现相同的目标,比如使用 itertools

以下是我的处理方式(并不是说它一定会更快)。首先,您可以根据开始和结束对数据进行排序。这意味着当我们稍后查看组合时,我们不必回溯到结果中的其他值(我们将知道 entry[i] 的开始必须小于或等于 entry[i+1] 的开始)

import itertools
import operator

data = [
    (6, 9, ['ataH']),
    (4, 9, ['svataH']),
    (0, 9, ['vEvasvataH']),
    (2, 5, ['vasu', 'vasU']),
    (1, 3, ['Eva', 'eva']),
    (0, 1, ['vA', 'vE'])
]

data = sorted(data, key=operator.itemgetter(0, 1))
start = min(data, key=operator.itemgetter(0))[0]
end = max(data, key=operator.itemgetter(1))[1]

现在我们对数据进行了排序,并且知道了我们的开始值和结束值。

要找到任意大小子集的所有数据组合,我们可以使用这个技巧:

def all_combinations(data):
    for i in range(len(data)):
        yield from itertools.combinations(data, r=i+1)

这里我们使用yield来避免创建昂贵的容器。现在我们编写另一个使用这些组合的方法,并检查每个组合的有效性:

def valid_combinations(data):
    for comb in all_combinations(data):
        if comb[0][0] != start or comb[-1][1] != end:
            continue
    
        for i in range(len(comb) - 1):
            if not 0 <= comb[i+1][0] - comb[i][1] <= 1:
                break
        else:
            yield comb

这里我使用了一个巧妙的 for 循环技巧。循环中的 else 块只有在 for 循环自然完成并且没有中断时才会执行,如果我们不中断,那么我们就知道每个项目都是有效的。

总计:

import itertools
import operator


def all_combinations(data):
    for i in range(len(data)):
        yield from itertools.combinations(data, r=i+1)


def valid_combinations(data):
    data = sorted(data, key=operator.itemgetter(0, 1))
    start = min(data, key=operator.itemgetter(0))[0]
    end = max(data, key=operator.itemgetter(1))[1]

    for comb in all_combinations(data):
        if comb[0][0] != start or comb[-1][1] != end:
            continue

        for i in range(len(comb) - 1):
            if not 0 <= comb[i+1][0] - comb[i][1] <= 1:
                break
        else:
            yield comb

获取结果:

from pprint import pprint
pprint(list(valid_combinations(
    [
        (6, 9, ['ataH']),
        (4, 9, ['svataH']),
        (0, 9, ['vEvasvataH']),
        (2, 5, ['vasu', 'vasU']),
        (1, 3, ['Eva', 'eva']),
        (0, 1, ['vA', 'vE'])
    ]
)))
[((0, 9, ['vEvasvataH']),),
 ((0, 1, ['vA', 'vE']), (1, 3, ['Eva', 'eva']), (4, 9, ['svataH'])),
 ((0, 1, ['vA', 'vE']), (2, 5, ['vasu', 'vasU']), (6, 9, ['ataH']))]

让我们假设您的 none 个节点允许完全重叠。您可以设置节点之间的连接图,并将其传递给类似 networkx 的东西,它实现了您感兴趣的可接受的最佳搜索算法。

您有一个 DAG,其连接由一个节点的末尾和下一个节点的开始之间的重叠表示。以下是填充图表的方法:

import networkx as nx

g = nx.DiGraph()
for i, n in enumerate(aList):
    g.add_node(i, start=n[0], end=n[1], data=n[2])

节点是aList中数据的索引,数据存储为属性。您可以在此处添加边。这是 O(n<sup>2</sup>),但如果您对数据进行适当的预排序,可以减少到 O(n log n)

for i, m in enumerate(aList):
    for j, n in enumerate(aList):
        if i != j and m[0] < n[0] and n[0] - m[1] <= 1 and m[1] < n[1]:
            g.add_edge(i, j)

现在您可以找到跨越整个图形的路径。首先确定起点:

from operator import itemgetter

minValue = min(aList, key=itemgetter(0))[0]
maxValue = max(aList, key=itemgetter(1))[1]

start = [i for i, n in enumerate(aList) if n[0] == minValue]
end = [i for i, n in enumerate(aList) if n[1] == maxValue]

可以使用 nx.all_simple_paths 找到路径,其中包括:

paths = []
for node in start:
    if node in end:
        paths.append([node])
    else:
        paths.extend(nx.all_simple_paths(g, node, end))

现在你有一个这样的列表:

[[2], [5, 3, 0], [5, 3, 1], [5, 4, 1], [5, 4, 3, 0], [5, 4, 3, 1]]

您可以使用结果中的索引对原始列表进行采样,或使用存储在图表中的元数据,具体取决于您的喜好。这是前一种方法:

result = [[aList[item] for item in path] for path in paths]

最终结果如下所示:

[[(0, 9, ['vEvasvataH'])],
 [(0, 1, ['vA', 'vE']), (2, 5, ['vasu', 'vasU']), (6, 9, ['ataH'])],
 [(0, 1, ['vA', 'vE']), (2, 5, ['vasu', 'vasU']), (4, 9, ['svataH'])],  # Do you want this (5 > 4)?
 [(0, 1, ['vA', 'vE']), (1, 3, ['Eva', 'eva']), (4, 9, ['svataH'])],
 [(0, 1, ['vA', 'vE']), (1, 3, ['Eva', 'eva']), (2, 5, ['vasu', 'vasU']), (6, 9, ['ataH'])], # Do you want this (3 > 2)?
 [(0, 1, ['vA', 'vE']), (1, 3, ['Eva', 'eva']), (2, 5, ['vasu', 'vasU']), (4, 9, ['svataH'])]] # Do you want this (3 > 2), (5 > 4)?

由于路径是有序的,因此很容易立即看出它们是如何跨越间隔的。

这是一个替代解决方案,它使用以 start_position 为关键字的字典来加快搜索速度。

import operator as op
import itertools as it
from collections import defaultdict

# These two accessor functions make it more legible to work with the tuples
start_position = op.itemgetter(0)
end_position = op.itemgetter(1)

def find_all_paths(list_of_tuples):
    last_position = end_position(max(list_of_tuples, key=end_position))

    tuples_dict = defaultdict(list)
    for t in list_of_tuples:
        tuples_dict[start_position(t)].append(t)

    def all_paths_starting_from(next_value):
        candidates = it.chain(tuples_dict[next_value], tuples_dict[next_value+1])

        paths = []
        for candidate in candidates:
            if end_position(candidate) == last_position:
                paths.append([candidate])
            else:
                for branch in all_paths_starting_from(end_position(candidate)):
                    paths.append([candidate] + branch)
        return paths
    return all_paths_starting_from(-1)

这个想法是,在一个大图中,最大的时间消耗者将搜索给定节点的下一个可能节点。 tuples_dict 将这个时间减少到本质上 O(1)。此外,递归的最大深度将为 len(tuples_dict).

如果输入集很大并且图形有很多分支和合并,all_paths_starting_from 函数可以被记忆以提高性能(但代码变得不太清晰)。记忆解决方案将是:

def find_all_paths(list_of_tuples):
    last_position = end_position(max(list_of_tuples, key=end_position))

    tuples_dict = defaultdict(list)
    for t in list_of_tuples:
        tuples_dict[start_position(t)].append(t)

    known_paths = {}  # Dict from next_value to the possible paths
    def all_paths_starting_from(next_value):
        if next_value in known_paths:
            return known_paths[next_value]
        candidates = it.chain(tuples_dict[next_value], tuples_dict[next_value+1])

        paths = []
        for candidate in candidates:
            if end_position(candidate) == last_position:
                paths.append([candidate])
            else:
                for branch in all_paths_starting_from(end_position(candidate)):
                    paths.append([candidate] + branch)
        known_paths[next_value] = paths
        return paths
    return all_paths_starting_from(-1)

这里是一个带有更多分支和合并(和死胡同)的图表示例:

test_input = [(0, 2, ['1']),
              (0, 3, ['2']),
              (0, 5, ['3']),
              (2, 7, ['4']),
              (3, 8, ['5']),
              (4, 9, ['6']),
              (8, 13, ['7']),
              (9, 14, ['8']),
              (10, 11, ['9']),
              (14, 15, ['10']),
              ]

from pprint import pprint
pprint(find_all_paths(test_input))

结果是:

[(0, 2, ['1']), (2, 7, ['4']), (8, 13, ['7']), (14, 15, ['10'])],
 [(0, 2, ['1']), (3, 8, ['5']), (8, 13, ['7']), (14, 15, ['10'])],
 [(0, 2, ['1']), (3, 8, ['5']), (9, 14, ['8']), (14, 15, ['10'])],
 [(0, 3, ['2']), (3, 8, ['5']), (8, 13, ['7']), (14, 15, ['10'])],
 [(0, 3, ['2']), (3, 8, ['5']), (9, 14, ['8']), (14, 15, ['10'])],
 [(0, 3, ['2']), (4, 9, ['6']), (9, 14, ['8']), (14, 15, ['10'])]]