预排序提高 itertools.combinations 性能？

Question

我目前在做大数据，工作中一个重要的步骤就是获取上千个字符串的所有排序组合。

在python 3.4中，我尝试了三种方法来执行这个操作。

首先，将数据伪造为字符串列表，有或没有 GO 术语：

# -*- coding: utf-8 -*-
import random as rd

# With GO term
# data = ["GO:" + str(i) for i in range(0, 10000)]
# Xor, without GO term
data = [str(i) for i in range(0, 10000)]

rd.shuffle(data)
print("shuffle done")

GO项只是添加到所有数据字符串开头的常量字符串。有和没有 GO 术语的结果如下。

现在，执行基准测试：

from itertools import combinations, product

def test1(): # use condition
    g = ((i,j) if (i < j) else (j,i) for i, j in combinations(data, 2))
    print(len([(i, j) for i, j in g]))


def test2(): # use filter
    g = ((i, j) for i,j in product(data, data) if (i < j))
    print(len([(i, j) for i, j in g]))


def test3(): # sort before combine
    g = ((i,j) for i, j in combinations(sorted(data), 2))
    print(len([(i, j) for i, j in g]))


import timeit

print(timeit.timeit(stmt=test1, number=3))
print(timeit.timeit(stmt=test2, number=3))
print(timeit.timeit(stmt=test3, number=3))

GO 项的输出：

49995000
49995000
49995000
23.490827083587646
49995000
49995000
49995000
31.04393219947815
49995000
49995000
49995000
16.878661155700684

没有 GO 项的输出：

49995000
49995000
49995000
22.99237084388733
49995000
49995000
49995000
29.025460958480835
49995000
49995000
49995000
16.596422910690308

每次调用打印的 49995000 向我们证明所有方法都产生相同数量的数据。现在的问题：

为什么第一种方法比第二种方法快？使用过滤器是我用来过滤数据的主要方式。到目前为止，我认为过滤器形式已经过大量优化。
为什么第三种方法似乎更好？排序是 O(N*log(N))，它似乎在大型列表上很昂贵？
为什么 GO 术语对第一种和第二种方法的影响大于对第三种方法的影响？

我的第一个猜测是排序+组合产生的数据比较要少得多，而其他两种方法对每对字符串进行比较，但是，由于排序听起来像是一项繁重的操作，我不太确定关于那个。

Answer 1

看到 combinations creates only the combinations where i<j holds true while product 如何创建所有组合，包括 i<j 不正确的组合 - test1 中的 if (i < j) else (j,i) 是多余的，这真的不是什么大惊喜。在 test1 中省略此检查会大大减少执行时间，如下所示。

用i<j签入test1:

shuffle done
49995000
49995000
49995000
31.66194307899991
49995000
49995000
49995000
37.66488860800018
49995000
49995000
49995000
22.706632076000005

没有 i<j 签入 test1:

shuffle done
49995000
49995000
49995000
25.07709688900013
49995000
49995000
49995000
39.405620851000094
49995000
49995000
49995000
23.54182383899979

Answer 2

我很确定造成您观察到的性能差异的重要因素是检查 if (i < j) 49995000 次与对 10000 个元素的列表进行排序，而不是假设的已排序与未排序的可迭代对象。

combinations 在这两种情况下应该做相同数量的工作，因为它们产生相同数量的元素并且不对元素进行排序并且 returns 它们按字典顺序排列。

要正确测试排序是否有所不同：

对同一组数据执行相同的条件检查，但已排序和未排序：

sorted_data = sorted(data)

def test1():
    g = ((i,j) if (i < j) else (j,i) for i, j in combinations(sorted_data, 2))
    return len([(i, j) for i, j in g])

def test2():
    g = ((i,j) if (i < j) else (j,i) for i, j in combinations(data, 2))
    return len([(i, j) for i, j in g])

%timeit test1()
1 loops, best of 3: 23.5 s per loop

%timeit test2()
1 loops, best of 3: 24.6 s per loop

在没有条件的情况下执行测试：

def test3():
    g = ((i,j) for i, j in combinations(sorted_data, 2))
    return len([(i, j) for i, j in g])

def test4():
    g = ((i,j) for i, j in combinations(data, 2))
    return len([(i, j) for i, j in g])

%timeit test3()
1 loops, best of 3: 20.7 s per loop

%timeit test4()
1 loops, best of 3: 21.3 s per loop

Why is the first method so quick, compared to the second ? Use of filter is the main way i use for filtering data. Until now, i assume that the filter form was heavily optimized.

使用组合会产生较少的条件检查元素。 10000C2 = 49995000 组合与 10000**2 = 100000000 产品。

Why the GO term impact more the first and the second method than the third one ?

第一种方法和第二种方法受额外比较字符影响49995000次和100000000次。第三个仅受对 10000 个项目进行排序所需的比较的影响。

经过一番调整后，似乎排序可能会有所不同，但不如使用条件排序那么大。不知道是什么原因造成的。

from itertools import combinations
import random as rd

data = ["{0:04d}".format(i) for i in range(0, 10000)] # Normalize str length
rd.shuffle(data)

sorted_data = sorted(data)
reversed_sorted_data = sorted_data[::-1]

def test1():
    g = list((i,j) if (i < j) else (j,i) for i, j in combinations(data, 2))
    print('unsorted with conditional: ', len(g))

%timeit test1()
# unsorted with conditional:  49995000
# unsorted with conditional:  49995000
# unsorted with conditional:  49995000
# unsorted with conditional:  49995000
# 1 loops, best of 3: 20.7 s per loop

def test2():
    g = list((i,j) if (i < j) else (j,i) for i, j in combinations(sorted_data, 2))
    print('sorted with conditional: ', len(g))

%timeit test2()
# sorted with conditional:  49995000
# sorted with conditional:  49995000
# sorted with conditional:  49995000
# sorted with conditional:  49995000
# 1 loops, best of 3: 19.6 s per loop

def test3():
    g = list((i,j) for i, j in combinations(data, 2))
    print('unsorted without conditional: ', len(g))
    
%timeit test3()
# unsorted without conditional:  49995000
# unsorted without conditional:  49995000
# unsorted without conditional:  49995000
# unsorted without conditional:  49995000
# 1 loops, best of 3: 15.7 s per loop

def test4():
    g = list((i,j) for i, j in combinations(sorted_data, 2))
    print('sorted without conditional: ', len(g))

%timeit test4()
# sorted without conditional:  49995000
# sorted without conditional:  49995000
# sorted without conditional:  49995000
# sorted without conditional:  49995000
# 1 loops, best of 3: 15.3 s per loop

def test5():
    g = list((i,j) for i, j in combinations(reversed_sorted_data, 2))
    print('reverse sorted without conditional: ', len(g))
    
%timeit test5()
# reverse sorted without conditional:  49995000
# reverse sorted without conditional:  49995000
# reverse sorted without conditional:  49995000
# reverse sorted without conditional:  49995000
# 1 loops, best of 3: 15 s per loop

预排序提高 itertools.combinations 性能？

pre-sorting improves itertools.combinations performances?

python

benchmarking

python-3.x