预排序提高 itertools.combinations 性能?
pre-sorting improves itertools.combinations performances?
我目前在做大数据,工作中一个重要的步骤就是获取上千个字符串的所有排序组合。
在python 3.4中,我尝试了三种方法来执行这个操作。
首先,将数据伪造为字符串列表,有或没有 GO 术语:
# -*- coding: utf-8 -*-
import random as rd
# With GO term
# data = ["GO:" + str(i) for i in range(0, 10000)]
# Xor, without GO term
data = [str(i) for i in range(0, 10000)]
rd.shuffle(data)
print("shuffle done")
GO项只是添加到所有数据字符串开头的常量字符串。有和没有 GO 术语的结果如下。
现在,执行基准测试:
from itertools import combinations, product
def test1(): # use condition
g = ((i,j) if (i < j) else (j,i) for i, j in combinations(data, 2))
print(len([(i, j) for i, j in g]))
def test2(): # use filter
g = ((i, j) for i,j in product(data, data) if (i < j))
print(len([(i, j) for i, j in g]))
def test3(): # sort before combine
g = ((i,j) for i, j in combinations(sorted(data), 2))
print(len([(i, j) for i, j in g]))
import timeit
print(timeit.timeit(stmt=test1, number=3))
print(timeit.timeit(stmt=test2, number=3))
print(timeit.timeit(stmt=test3, number=3))
GO 项的输出:
49995000
49995000
49995000
23.490827083587646
49995000
49995000
49995000
31.04393219947815
49995000
49995000
49995000
16.878661155700684
没有 GO 项的输出:
49995000
49995000
49995000
22.99237084388733
49995000
49995000
49995000
29.025460958480835
49995000
49995000
49995000
16.596422910690308
每次调用打印的 49995000 向我们证明所有方法都产生相同数量的数据。
现在的问题:
- 为什么第一种方法比第二种方法快?使用过滤器是我用来过滤数据的主要方式。到目前为止,我认为过滤器形式已经过大量优化。
- 为什么第三种方法似乎更好?排序是 O(N*log(N)),它似乎在大型列表上很昂贵?
- 为什么 GO 术语对第一种和第二种方法的影响大于对第三种方法的影响?
我的第一个猜测是排序+组合产生的数据比较要少得多,而其他两种方法对每对字符串进行比较,但是,由于排序听起来像是一项繁重的操作,我不太确定关于那个。
- 看到 combinations creates only the combinations where
i<j
holds true while product 如何创建所有组合,包括 i<j
不正确的组合 - test1
中的 if (i < j) else (j,i)
是多余的,这真的不是什么大惊喜。在 test1
中省略此检查会大大减少执行时间,如下所示。
用i<j
签入test1
:
shuffle done
49995000
49995000
49995000
31.66194307899991
49995000
49995000
49995000
37.66488860800018
49995000
49995000
49995000
22.706632076000005
没有 i<j
签入 test1
:
shuffle done
49995000
49995000
49995000
25.07709688900013
49995000
49995000
49995000
39.405620851000094
49995000
49995000
49995000
23.54182383899979
我很确定造成您观察到的性能差异的重要因素是检查 if (i < j)
49995000 次与对 10000 个元素的列表进行排序,而不是假设的已排序与未排序的可迭代对象。
combinations
在这两种情况下应该做相同数量的工作,因为它们产生相同数量的元素并且不对元素进行排序并且 returns 它们按字典顺序排列。
要正确测试排序是否有所不同:
对同一组数据执行相同的条件检查,但已排序和未排序:
sorted_data = sorted(data)
def test1():
g = ((i,j) if (i < j) else (j,i) for i, j in combinations(sorted_data, 2))
return len([(i, j) for i, j in g])
def test2():
g = ((i,j) if (i < j) else (j,i) for i, j in combinations(data, 2))
return len([(i, j) for i, j in g])
%timeit test1()
1 loops, best of 3: 23.5 s per loop
%timeit test2()
1 loops, best of 3: 24.6 s per loop
在没有条件的情况下执行测试:
def test3():
g = ((i,j) for i, j in combinations(sorted_data, 2))
return len([(i, j) for i, j in g])
def test4():
g = ((i,j) for i, j in combinations(data, 2))
return len([(i, j) for i, j in g])
%timeit test3()
1 loops, best of 3: 20.7 s per loop
%timeit test4()
1 loops, best of 3: 21.3 s per loop
Why is the first method so quick, compared to the second ? Use of filter is the main way i use for filtering data. Until now, i assume that the filter form was heavily optimized.
使用组合会产生较少的条件检查元素。 10000C2 = 49995000
组合与 10000**2 = 100000000
产品。
Why the GO term impact more the first and the second method than the third one ?
第一种方法和第二种方法受额外比较字符影响49995000次和100000000次。第三个仅受对 10000 个项目进行排序所需的比较的影响。
经过一番调整后,似乎排序可能会有所不同,但不如使用条件排序那么大。不知道是什么原因造成的。
from itertools import combinations
import random as rd
data = ["{0:04d}".format(i) for i in range(0, 10000)] # Normalize str length
rd.shuffle(data)
sorted_data = sorted(data)
reversed_sorted_data = sorted_data[::-1]
def test1():
g = list((i,j) if (i < j) else (j,i) for i, j in combinations(data, 2))
print('unsorted with conditional: ', len(g))
%timeit test1()
# unsorted with conditional: 49995000
# unsorted with conditional: 49995000
# unsorted with conditional: 49995000
# unsorted with conditional: 49995000
# 1 loops, best of 3: 20.7 s per loop
def test2():
g = list((i,j) if (i < j) else (j,i) for i, j in combinations(sorted_data, 2))
print('sorted with conditional: ', len(g))
%timeit test2()
# sorted with conditional: 49995000
# sorted with conditional: 49995000
# sorted with conditional: 49995000
# sorted with conditional: 49995000
# 1 loops, best of 3: 19.6 s per loop
def test3():
g = list((i,j) for i, j in combinations(data, 2))
print('unsorted without conditional: ', len(g))
%timeit test3()
# unsorted without conditional: 49995000
# unsorted without conditional: 49995000
# unsorted without conditional: 49995000
# unsorted without conditional: 49995000
# 1 loops, best of 3: 15.7 s per loop
def test4():
g = list((i,j) for i, j in combinations(sorted_data, 2))
print('sorted without conditional: ', len(g))
%timeit test4()
# sorted without conditional: 49995000
# sorted without conditional: 49995000
# sorted without conditional: 49995000
# sorted without conditional: 49995000
# 1 loops, best of 3: 15.3 s per loop
def test5():
g = list((i,j) for i, j in combinations(reversed_sorted_data, 2))
print('reverse sorted without conditional: ', len(g))
%timeit test5()
# reverse sorted without conditional: 49995000
# reverse sorted without conditional: 49995000
# reverse sorted without conditional: 49995000
# reverse sorted without conditional: 49995000
# 1 loops, best of 3: 15 s per loop
我目前在做大数据,工作中一个重要的步骤就是获取上千个字符串的所有排序组合。
在python 3.4中,我尝试了三种方法来执行这个操作。
首先,将数据伪造为字符串列表,有或没有 GO 术语:
# -*- coding: utf-8 -*-
import random as rd
# With GO term
# data = ["GO:" + str(i) for i in range(0, 10000)]
# Xor, without GO term
data = [str(i) for i in range(0, 10000)]
rd.shuffle(data)
print("shuffle done")
GO项只是添加到所有数据字符串开头的常量字符串。有和没有 GO 术语的结果如下。
现在,执行基准测试:
from itertools import combinations, product
def test1(): # use condition
g = ((i,j) if (i < j) else (j,i) for i, j in combinations(data, 2))
print(len([(i, j) for i, j in g]))
def test2(): # use filter
g = ((i, j) for i,j in product(data, data) if (i < j))
print(len([(i, j) for i, j in g]))
def test3(): # sort before combine
g = ((i,j) for i, j in combinations(sorted(data), 2))
print(len([(i, j) for i, j in g]))
import timeit
print(timeit.timeit(stmt=test1, number=3))
print(timeit.timeit(stmt=test2, number=3))
print(timeit.timeit(stmt=test3, number=3))
GO 项的输出:
49995000
49995000
49995000
23.490827083587646
49995000
49995000
49995000
31.04393219947815
49995000
49995000
49995000
16.878661155700684
没有 GO 项的输出:
49995000
49995000
49995000
22.99237084388733
49995000
49995000
49995000
29.025460958480835
49995000
49995000
49995000
16.596422910690308
每次调用打印的 49995000 向我们证明所有方法都产生相同数量的数据。 现在的问题:
- 为什么第一种方法比第二种方法快?使用过滤器是我用来过滤数据的主要方式。到目前为止,我认为过滤器形式已经过大量优化。
- 为什么第三种方法似乎更好?排序是 O(N*log(N)),它似乎在大型列表上很昂贵?
- 为什么 GO 术语对第一种和第二种方法的影响大于对第三种方法的影响?
我的第一个猜测是排序+组合产生的数据比较要少得多,而其他两种方法对每对字符串进行比较,但是,由于排序听起来像是一项繁重的操作,我不太确定关于那个。
- 看到 combinations creates only the combinations where
i<j
holds true while product 如何创建所有组合,包括i<j
不正确的组合 -test1
中的if (i < j) else (j,i)
是多余的,这真的不是什么大惊喜。在test1
中省略此检查会大大减少执行时间,如下所示。
用i<j
签入test1
:
shuffle done
49995000
49995000
49995000
31.66194307899991
49995000
49995000
49995000
37.66488860800018
49995000
49995000
49995000
22.706632076000005
没有 i<j
签入 test1
:
shuffle done
49995000
49995000
49995000
25.07709688900013
49995000
49995000
49995000
39.405620851000094
49995000
49995000
49995000
23.54182383899979
我很确定造成您观察到的性能差异的重要因素是检查 if (i < j)
49995000 次与对 10000 个元素的列表进行排序,而不是假设的已排序与未排序的可迭代对象。
combinations
在这两种情况下应该做相同数量的工作,因为它们产生相同数量的元素并且不对元素进行排序并且 returns 它们按字典顺序排列。
要正确测试排序是否有所不同:
对同一组数据执行相同的条件检查,但已排序和未排序:
sorted_data = sorted(data) def test1(): g = ((i,j) if (i < j) else (j,i) for i, j in combinations(sorted_data, 2)) return len([(i, j) for i, j in g]) def test2(): g = ((i,j) if (i < j) else (j,i) for i, j in combinations(data, 2)) return len([(i, j) for i, j in g]) %timeit test1() 1 loops, best of 3: 23.5 s per loop %timeit test2() 1 loops, best of 3: 24.6 s per loop
在没有条件的情况下执行测试:
def test3(): g = ((i,j) for i, j in combinations(sorted_data, 2)) return len([(i, j) for i, j in g]) def test4(): g = ((i,j) for i, j in combinations(data, 2)) return len([(i, j) for i, j in g]) %timeit test3() 1 loops, best of 3: 20.7 s per loop %timeit test4() 1 loops, best of 3: 21.3 s per loop
Why is the first method so quick, compared to the second ? Use of filter is the main way i use for filtering data. Until now, i assume that the filter form was heavily optimized.
使用组合会产生较少的条件检查元素。 10000C2 = 49995000
组合与 10000**2 = 100000000
产品。
Why the GO term impact more the first and the second method than the third one ?
第一种方法和第二种方法受额外比较字符影响49995000次和100000000次。第三个仅受对 10000 个项目进行排序所需的比较的影响。
经过一番调整后,似乎排序可能会有所不同,但不如使用条件排序那么大。不知道是什么原因造成的。
from itertools import combinations
import random as rd
data = ["{0:04d}".format(i) for i in range(0, 10000)] # Normalize str length
rd.shuffle(data)
sorted_data = sorted(data)
reversed_sorted_data = sorted_data[::-1]
def test1():
g = list((i,j) if (i < j) else (j,i) for i, j in combinations(data, 2))
print('unsorted with conditional: ', len(g))
%timeit test1()
# unsorted with conditional: 49995000
# unsorted with conditional: 49995000
# unsorted with conditional: 49995000
# unsorted with conditional: 49995000
# 1 loops, best of 3: 20.7 s per loop
def test2():
g = list((i,j) if (i < j) else (j,i) for i, j in combinations(sorted_data, 2))
print('sorted with conditional: ', len(g))
%timeit test2()
# sorted with conditional: 49995000
# sorted with conditional: 49995000
# sorted with conditional: 49995000
# sorted with conditional: 49995000
# 1 loops, best of 3: 19.6 s per loop
def test3():
g = list((i,j) for i, j in combinations(data, 2))
print('unsorted without conditional: ', len(g))
%timeit test3()
# unsorted without conditional: 49995000
# unsorted without conditional: 49995000
# unsorted without conditional: 49995000
# unsorted without conditional: 49995000
# 1 loops, best of 3: 15.7 s per loop
def test4():
g = list((i,j) for i, j in combinations(sorted_data, 2))
print('sorted without conditional: ', len(g))
%timeit test4()
# sorted without conditional: 49995000
# sorted without conditional: 49995000
# sorted without conditional: 49995000
# sorted without conditional: 49995000
# 1 loops, best of 3: 15.3 s per loop
def test5():
g = list((i,j) for i, j in combinations(reversed_sorted_data, 2))
print('reverse sorted without conditional: ', len(g))
%timeit test5()
# reverse sorted without conditional: 49995000
# reverse sorted without conditional: 49995000
# reverse sorted without conditional: 49995000
# reverse sorted without conditional: 49995000
# 1 loops, best of 3: 15 s per loop