在列表列表中查找最长最频繁的子集

Finding the longest most frequent subset in list of lists

我需要一种高效的 Python 算法来获得以下结果:

示例 1:

l = [[1, 2, 3, 4], [2, 1, 4], [3, 1], [4, 1, 2]]

r = [[1, 2, 4], [3]]

示例 2:

l = [[1, 2, 3, 4], [2, 1], [3, 1], [4, 1]]

r = [[1, 2], [3], [4]]

示例 3:

l = [[1], [2, 3, 4], [3, 2, 5, 6], [4, 2] , [5, 3], [6, 3]]

r = [[2, 3], [1], [4], [5], [6]]

在这些示例中,l 是列表的列表,其中:

我需要找到一种使用 3 个标准对匹配子集进行分组的有效方法:子集的长度、子元素的频率以及子元素只能是 1 个结果子集的一部分的事实。

在第一个示例中,子元素 [1] 将是最频繁的子集,但在 4 个元素中的 3 个元素中找到较长的子集 [1, 2, 4] 这一事实更为重要:长度胜过频率。

在第二个示例中,[1, 3] 等分组与 [1, 2] 具有相同的长度和频率,但 1 被第一个找到的子集“采用”。

稍后编辑:

到目前为止我所做的是:

我的代码完全没有效率,因为排列的数量随着我的初始字典的大小呈指数增长,因此我正在寻找一个新的想法,一种新的方法。

这是我到目前为止所做的:


from itertools import chain, permutations

def group_matches(my_dict, matched=None):

    def update_my_dict(my_dict, matched):
        ret_val = {}
        for k, v in my_dict.items():
            if k not in matched:
                for unique_ind in matched:
                    if unique_ind in v:
                        v.remove(unique_ind)
                ret_val[k] = v
        return ret_val

    def get_matches(unique_ind_permutation, my_dict):
        def create_matrix(unique_ind_permutation, my_dict):
            matrix = []
            for k in unique_ind_permutation:
                r = [True if f in my_dict[k] else False
                     for f in unique_ind_permutation]
                matrix += [r]
            return matrix
        matrix = create_matrix(unique_ind_permutation, my_dict)
        dp = [[0] * len(matrix) for _ in range(len(matrix))]
        max_squares = [(0, None, None)]
        for ri, r in enumerate(matrix):
            for ci, c in enumerate(r):
                dp[ri][ci] = 0 if not c \
                    else (1 if ri == 0 or ci == 0
                    else min(dp[ri - 1][ci], dp[ri][ci - 1], dp[ri - 1][ci - 1]) + 1)
                max_squares = [(dp[ri][ci], ri, ci)] if dp[ri][ci] > max_squares[0][0] \
                    else (max_squares + [(dp[ri][ci], ri, ci)] if dp[ri][ci] == max_squares[0][0]
                    else max_squares)
        matches = []
        if max_squares[0][0] != 0:
            for max_square in max_squares:
                rows = [r for r in range(max_square[1]+1-max_square[0],max_square[1]+1)]
                columns = [c for c in range(max_square[2]+1-max_square[0],max_square[2]+1)]
                if rows == columns:
                    matches += [tuple(rows)]
            matches = eliminate_common_matches(matches)
        matches_to_unique_ind = []
        l = 0
        if len(matches) > 0:
            l = len(matches[0])
            for m in matches:
                m_unique_ind = sorted([unique_ind_permutation[x] for x in m])
                matches_to_unique_ind += [m_unique_ind]
        return matches_to_unique_ind, l

    def eliminate_common_matches(matches):
        for m in matches:
            aux = matches.copy()
            aux.remove(m)
            for a in aux:
                common = (set(m) & set(a))
                if len(common) > 0:
                    min_m = min(m)
                    min_a = min(a)
                    if min_m <= min_a:
                        matches.remove(a)
                    else:
                        matches.remove(m)
        return matches

    def find_matched(unique_indexes, matches):
        matched = []
        unmatched = []
        for unique_ind in unique_indexes:
            for m in matches:
                if unique_ind in m:
                    matched += [unique_ind]
                else:
                    unmatched += [unique_ind]
        return matched, unmatched

    if matched is not None:
        my_dict = update_my_dict(my_dict, matched)
    unique_indexes = list(my_dict.keys())
    p_unique_indexes = list(permutations(unique_indexes))
    matches = []
    last_l = 0
    for p in p_unique_indexes:
        m, l = get_matches(p, my_dict)
        if last_l < l:
            matches.clear()
            last_l = l
        if last_l == l and l > 0:
            matches += m
    matches = set(tuple(l) for l in matches)
    matches_order = []
    for m in matches:
        mx = sorted([unique_indexes.index(unique_ind_x) for unique_ind_x in m])
        matches_order += [mx]
    matches_order = eliminate_common_matches(matches_order)
    matches = []
    for mo in matches_order:
        mx = [unique_indexes[x] for x in mo]
        matches += [mx]
    matched, unmatched = find_matched(unique_indexes, matches)
    return matches, matched, unmatched

my_dict = {1:[1, 2, 3, 4],
           2:[2, 1, 4],
           3:[3, 1],
           4:[4, 1, 2]}     
unique_indexes = list(my_dict.keys())
matches = []
matched = None
while True:
    instance_matches, matched, unmatched = group_matches(my_dict, matched)
    if len(instance_matches) > 0:
        matches += instance_matches
    if len(unmatched) == 0 or len(instance_matches) == 0:
        break
unmatched = list(set(unique_indexes) - set(list(chain(*matches))))
matches_unique = []
for i, x in enumerate(matches):
    if x not in matches[:i]:
        matches_unique += [x]
matches = matches_unique + unmatched      
print(matches)

另一个更复杂的例子:


my_dict = {
            'a':['a', 'b', 'c', 'h'],
            'b':['b', 'a', 'c', 'i'],
            'c':['c', 'a', 'b', 'd', 'e'],
            'd':['d', 'c', 'e', 'f'],
            'e':['e', 'c', 'd', 'f'],
            'f':['f', 'd', 'e', 'g', 'h'],
            'g':['g', 'f', 'h'],
            'h':['h', 'a', 'f', 'g'],
            'i':['i',  'b']
          }
# expected outcome:
res = [['c', 'd', 'e'], ['f', 'g', 'h'], ['a', 'b'], ['i']]

子集 ['d'、'e'、'f'] 不是预期结果的一部分,因为 'd' 和 'e' 已经被第一个子集采用

您可以将 lists/sets 问题视为图形问题,方法是将集合视为节点并将它们包含的值视为顶点,它之所以有效,是因为:

  • if element 1 contains the value 3, then in its turn the 3rd list will always contain the 1 itself.

    表示所有边都是双向的

  • the first element of each list is always the index of the element

    表示每个节点都连接到自己

例如,您的第一个输入是[[1, 2, 3, 4], [2, 1, 4], [3, 1], [4, 1, 2]],所以对应的图形是:

┌───┐         ┌───┐
│   │         │   │
└─set 1─────set 2─┘
    │  \      │
    │   \     │
    │    \    │
    │     \   │
    │      \  │
    │       \ │
┌─set 3     set 4─┐
│   │         │   │
└───┘         └───┘

从这个角度来看,“找到这些集合的最大公共子集”可以翻译为“找到所有节点都相互连接的最大子图”,也称为maximum clique problem

确实,此图的最大集团是 {1, 2, 4}

基于这种方法,这里有一个可行的解决方案:

from typing import List, Set, Tuple


def get_maximum_clique(edges: Set[Tuple[int, int]], vertices_number: Set[int]) -> Set[int]:
    biggest_maximal_clique_so_far = {}  # trivial solution
    # to find the maximum clique, we start by iterating over each vertex (which by itself is already a clique)
    for seed_vertex in vertices_number:
        # and we try to grow the clique by adding a new vertex to the clique iif it is connected to every vertex of the clique
        # until we exhausted every possible vertex to add
        candidate_vertices = vertices_number.difference({seed_vertex})
        clique = {seed_vertex}
        while candidate_vertices:  # is not empty
            candidate_vertex = candidate_vertices.pop()
            # if the candidate vertex is connected to all the vertices already in the clique
            if all((vertex, candidate_vertex) in edges for vertex in clique):
                # grow the clique
                clique.update({candidate_vertex})
        # we grew the maximal clique from this seed, let's check its size
        if not biggest_maximal_clique_so_far or len(clique) > len(biggest_maximal_clique_so_far):
            biggest_maximal_clique_so_far = clique

    maximum_clique = biggest_maximal_clique_so_far
    return maximum_clique


def get_largest_most_frequent_subsets(list_of_lists: List[List[int]]) -> List[List[int]]:
    # we prepare a list of all vertices of the graph
    all_vertices_number = set(connected_vertex for vertex_edges in list_of_lists for connected_vertex in vertex_edges)

    connected_vertices_of_each_vertex = list_of_lists  # simple renaming

    decreasing_maximum_cliques = []
    while any(connected_vertices_of_each_vertex):  # contains a non-empty subset i.e. a vertex with remaining edges
        # we create our graph as a set of edges of the form (v1, v2)
        edges = {
            (vertex1, vertex2)
            for vertex1, *edges in (connected_vertices for connected_vertices in connected_vertices_of_each_vertex
                                    if len(connected_vertices) >= 2)
            for vertex2 in edges
            # we could add `if vertex1 < vertex2` to halve the number of directional edges because all edges are
            # bidirectional, but I prefer not to worry with the order later
        }

        # we get the clique for this graph
        maximum_clique_remaining = get_maximum_clique(edges, all_vertices_number)
        decreasing_maximum_cliques.append(maximum_clique_remaining)
        # we remove the edges from this clique to all vertices left,
        # and all edges from other vertices to those of the clique
        connected_vertices_of_each_vertex = [
            [
                connected_vertex for connected_vertex in connected_vertices
                if connected_vertex not in maximum_clique_remaining
            ]
            if (connected_vertices and connected_vertices[0] not in maximum_clique_remaining) else []
            for connected_vertices in connected_vertices_of_each_vertex
        ]
        all_vertices_number.difference_update(maximum_clique_remaining)

    return [list(sorted(clique)) for clique in decreasing_maximum_cliques]  # serves only to match the expected output


def test__get_largest_most_frequent_subsets() -> None:
    assert get_largest_most_frequent_subsets([[1, 2, 3, 4], [2, 1, 4], [3, 1], [4, 1, 2]]) == [[1, 2, 4], [3]]
    assert get_largest_most_frequent_subsets([[1, 2, 3, 4], [2, 1], [3, 1], [4, 1]]) == [[1, 2], [3], [4]]
    assert get_largest_most_frequent_subsets([[1], [2, 3, 4], [3, 2, 5, 6], [4, 2], [5, 3], [6, 3]]) == [[2, 3], [1], [4], [5], [6]]

    my_dict = {
        'a': ['a', 'b', 'c', 'h'],
        'b': ['b', 'a', 'c', 'i'],
        'c': ['c', 'a', 'b', 'd', 'e'],
        'd': ['d', 'c', 'e', 'f'],
        'e': ['e', 'c', 'd', 'f'],
        'f': ['f', 'd', 'e', 'g', 'h'],
        'g': ['g', 'f', 'h'],
        'h': ['h', 'a', 'f', 'g'],
        'i': ['i', 'b']
    }
    my_list_of_lists = list(values for key, values in sorted(my_dict.items()))
    result = get_largest_most_frequent_subsets(my_list_of_lists)
    # there are multiple combinations of decreasing maximum cliques :
    assert (result == [['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h'], ['i']]) \
           or (result == [['a', 'b', 'c'], ['f', 'g', 'h'], ['d', 'e'], ['i']]) \
           or (result == [['c', 'd', 'e'], ['f', 'g', 'h'], ['a', 'b'], ['i']]) \
           or (result == [['c', 'd', 'e'], ['f', 'g', 'h'], ['b', 'i'], ['a']]) \
           or (result == [['d', 'e', 'f'], ['a', 'b', 'c'], ['g', 'h'], ['i']]) \
           or (result == [['f', 'g', 'h'], ['a', 'b', 'c'], ['d', 'e'], ['i']]) \
           or (result == [['f', 'g', 'h'], ['c', 'd', 'e'], ['b', 'i'], ['a']]), repr(result)


if __name__ == "__main__":
    test__get_largest_most_frequent_subsets()

我的代码通过了您的 4 次测试,而且速度很快:

if __name__ == "__main__":
    import timeit
    print(timeit.timeit("test__get_largest_most_frequent_subsets()", number=1000, globals=locals()))
    # 0.1920557

如果它对您来说不够快(您没有要求特定的持续时间),可能有一些方法可以优化此代码。我没有尝试分析它。
我实现了一个贪心算法,但维基百科页面上列出了一些计算复杂度较低的聪明算法,但我没有尝试实现它们。 已经存在一些针对最大团问题的高性能实现,例如 this answer 表示在非并行 C++/Boost 中存在一个 weighted 最大团(的你的问题是一个特例,所有权重都相等)。
但我不认为已经存在你所寻求的性能版本(我称之为“减少最大派系”)。