查找同一集合的两个分区之间所有不同交集的 Pythonic 和有效方法

Pythonic and efficient way to find all the different intersections between two partitions of the same set

我需要找到同一组的两个分区之间的所有不同交集。例如,如果我们有以下两个相同集合的分区

x = [[1, 2], [3, 4, 5], [6, 7, 8, 9, 10]]
y = [[1, 3, 6, 7], [2, 4, 5, 8, 9, 10]]

要求的结果是

[[1], [2], [3], [4, 5], [6, 7], [8, 9, 10]].

详细地说,我们计算 x 和 y 的每个子集之间的笛卡尔积,并且对于这些乘积中的每一个,如果它们属于其相关子集的交集,我们将相应地对新子集中的元素进行分类。

什么是最佳的/更 pythonic 的方式来做到这一点?提前致谢!


当前答案的性能比较:

import numpy as np

def partitioning(alist, indices):
    return [alist[i:j] for i, j in zip([0]+indices, indices+[None])]

total = 1000
sample1 = np.sort(np.random.choice(total, int(total/10), replace=False))
sample2 = np.sort(np.random.choice(total, int(total/2), replace=False))

a = partitioning(np.arange(total), list(sample1))
b = partitioning(np.arange(total), list(sample2))

def partition_decomposition_product_1(x, y):
    out = []
    for sublist1 in x:
        d = {}
        for val in sublist1:
            for i, sublist2 in enumerate(y):
                if val in sublist2:
                    d.setdefault(i, []).append(val)
        out.extend(d.values())
    return out

def partition_decomposition_product_2(x, y):
    all_s = []
    for sx in x:
        for sy in y:
            ss = list(filter(lambda x:x in sx, sy))
            if ss:
                all_s.append(ss)
    return all_s

def partition_decomposition_product_3(x, y):
    return [np.intersect1d(i,j) for i in x for j in y]

并使用 %timeit

测量执行时间
%timeit partition_decomposition_product_1(a, b)
%timeit partition_decomposition_product_2(a, b)
%timeit partition_decomposition_product_3(a, b)

我们发现

2.16 s ± 246 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
620 ms ± 84.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.03 s ± 111 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

因此第二种解决方案是最快的。

我不确定我是否理解正确,但这个脚本产生了你在问题中得到的结果:

x = [[1, 2], [3, 4, 5], [6, 7, 8, 9, 10]]
y = [[1, 3, 6, 7], [2, 4, 5, 8, 9, 10]]

out = []
for sublist1 in x:
    d = {}
    for val in sublist1:
        for i, sublist2 in enumerate(y):
            if val in sublist2:
                d.setdefault(i, []).append(val)
    out.extend(d.values())

print(out)

打印:

[[1], [2], [3], [4, 5], [6, 7], [8, 9, 10]]

这两个列表是同一个集合的分区与算法选择无关。这归结为遍历两个列表列表并获取每个组合之间的交集(您可以在函数开头添加该断言以确保它们是同一集合的分区,使用 this answer to flatten the lists efficiently). With this in mind, this function accomplishes the task, using this answer 计算列表交集:

def func2(x, y):
    # check that they partition the same set 
    checkx = sorted([item for sublist in x for item in sublist])
    checky = sorted([item for sublist in y for item in sublist])
    assert checkx == checky

    # get all intersections
    all_s = []
    for sx in x:
        for sy in y:
            ss = list(filter(lambda x:x in sx, sy))
            if ss:
                all_s.append(ss)
    return all_s

然后使用 this time comparison method,我们可以看到这个新函数比您原来的实现快约 100 倍。

我可能会遗漏一些细节,但它似乎有点太简单了:

[np.intersect1d(a,b) for a in x for b in y]

输出:

[array([1]),
 array([2]),
 array([3]),
 array([4, 5]),
 array([6, 7]),
 array([ 8,  9, 10])]

以上包括重复项,例如 x=[[1,2,3],[1,4,5]]y=[[1,6,7]] 会给出 [[1],[1]]


如果你想找到独特的路口:

[list(i) for i in {tuple(np.intersect1d(a,b)) for a in x for b in y}]

输出:

[[8, 9, 10], [6, 7], [1], [4, 5], [2], [3]]