生成器函数(yield)比迭代器快得多 class (__next__)
Generator function (yield) much faster then iterator class (__next__)
UPDATE(反映最先进的知识水平)状态:2017-05-12
进行此更新的原因是,在我问这个问题时,我并不知道我发现了一些关于 Python3 如何工作的东西 "under the hood"。
所有接下来的结论是:
If you write own Python3 code for an iterator and care about speed of execution you should write it as a generator function and not as an iterator class.
下面是一个简单的代码示例,演示了相同的算法 (此处:Pythons range()
的自制版本) 表示为生成函数运行s 比表示为迭代器 class 快得多:
def gnrtYieldRange(startWith, endAt, step=1):
while startWith <= endAt:
yield startWith
startWith += step
class iterClassRange:
def __init__(self, startWith, endAt, step=1):
self.startWith = startWith - 1
self.endAt = endAt
self.step = step
def __iter__(self):
return self
def __next__(self):
self.startWith += self.step
if self.startWith <= self.endAt:
return self.startWith
else:
raise StopIteration
N = 10000000
print(" Size of created list N = {} elements (ints 1 to N)".format(N))
from time import time as t
from customRange import gnrtYieldRange as cthnYieldRange
from customRange import cintYieldRange
from customRange import iterClassRange as cthnClassRange
from customRange import cdefClassRange
iterPythnRangeObj = range(1, N+1)
gnrtYieldRangeObj = gnrtYieldRange(1, N)
cthnYieldRangeObj = cthnYieldRange(1, N)
cintYieldRangeObj = cintYieldRange(1, N)
iterClassRangeObj = iterClassRange(1, N)
cthnClassRangeObj = cthnClassRange(1, N)
cdefClassRangeObj = cdefClassRange(1, N)
sEXECs = [
"liPR = list(iterPythnRangeObj)",
"lgYR = list(gnrtYieldRangeObj)",
"lcYR = list(cthnYieldRangeObj)",
"liGR = list(cintYieldRangeObj)",
"liCR = list(iterClassRangeObj)",
"lcCR = list(cthnClassRangeObj)",
"ldCR = list(cdefClassRangeObj)"
]
sCOMMENTs = [
"Python3 own range(1, N+1) used here as reference for timings ",
"self-made range generator function using yield (run as it is) ",
"self-made range (with yield) run from module created by Cython",
"Cython-optimized self-made range (using yield) run from module",
"self-made range as iterator class using __next__() and return ",
"self-made range (using __next__) from module created by Cython",
"Cython-optimized self-made range (using __next__) from module "
]
for idx, sEXEC in enumerate(sEXECs):
s=t();exec(sEXEC);e=t();print("{} takes: {:3.1f} sec.".format(sCOMMENTs[idx], e-s))
print("All created lists are equal:", all([liPR == lgYR, lgYR == lcYR, lcYR == liGR, liGR == liCR, liCR == lcCR, lcCR == ldCR]) )
print("Run on Linux Mint 18.1, used Cython.__version__ == '0.25.2'")
上面的代码放入一个文件,运行ned 打印到标准输出:
>python3.6 -u "gnrtFunction-fasterThan-iterClass_runMe.py"
Size of created list N = 10000000 elements (ints 1 to N)
Python3 own range(1, N+1) used here as reference for timings takes: 0.2 sec.
self-made range generator function using yield (run as it is) takes: 1.1 sec.
self-made range (with yield) run from module created by Cython takes: 0.5 sec.
Cython-optimized self-made range (using yield) run from module takes: 0.3 sec.
self-made range as iterator class using __next__() and return takes: 3.9 sec.
self-made range (using __next__) from module created by Cython takes: 3.3 sec.
Cython-optimized self-made range (using __next__) from module takes: 0.2 sec.
All created lists are equal: True
Run on Linux Mint 18.1, used Cython.__version__ == '0.25.2'
>Exit code: 0
从上面的时间可以看出,自制range()
迭代器运行的生成器函数变体比迭代器class变体快,并且没有优化代码涉及此行为也会传播到 Cython 创建的 C 代码的 C 代码级别。
如果您好奇为什么会这样,您可以阅读提供的答案或自己玩一下提供的代码。
下面缺少 运行 上面代码所必需的代码片段:
customRange.pyx
- 文件 Cython 从以下位置创建 customRange
模块:
def gnrtYieldRange(startWith, endAt, step=1):
while startWith <= endAt:
yield startWith
startWith += step
class iterClassRange:
def __init__(self, startWith, endAt, step=1):
self.startWith = startWith - 1
self.endAt = endAt
self.step = step
def __iter__(self):
return self
def __next__(self):
self.startWith += self.step
if self.startWith <= self.endAt:
return self.startWith
else:
raise StopIteration
def cintYieldRange(int startWith, int endAt, int step=1):
while startWith <= endAt:
yield startWith
startWith += step
cdef class cdefClassRange:
cdef int startWith
cdef int endAt
cdef int step
def __init__(self, int startWith, int endAt, int step=1):
self.startWith = startWith - 1
self.endAt = endAt
self.step = step
def __iter__(self):
return self
def __next__(self):
self.startWith += self.step
if self.startWith <= self.endAt:
return self.startWith
else:
raise StopIteration
和用于创建 Python customRange
模块的安装文件 customRange-setup.py
:
import sys
sys.argv += ['build_ext', '--inplace']
from distutils.core import setup
from Cython.Build import cythonize
setup(
name = 'customRange',
ext_modules = cythonize("customRange.pyx"),
)
现在一些进一步的信息可以更容易地理解所提供的答案:
当时我问这个问题我正忙于一个相当复杂的问题
使用 yield
从以生成函数形式提供的非唯一列表生成唯一组合的算法。我的目标是使用此算法创建一个用 C 编写的 Python 模块,以使其 运行 更快。为此,我使用 __next__()
和 return
将使用 yield
的生成器函数重写为迭代器 class。当我比较算法的两种变体的速度时,令我惊讶的是迭代器 class 比生成器函数慢两倍,我曾(错误地)假设它有一些东西与我重写算法的方式有关(如果你想更好地理解这里的答案是什么,你需要知道这个)因此
Originally asked how to make the iterator class version run at the same speed as the generator function and where the speed difference comes from?.
下面是一些关于历史问题的更多信息:
在下面提供的 Python 脚本代码中,使用 Python function
和 yield
并使用 class
和 __next__
。代码在 copy/paste 之后准备好 运行,所以你可以自己看看我在说什么。
对于纯 Python 代码观察到的相同现象传播到 Python 扩展模块的 C 代码中,扩展模块由 Cython 创建,因此它不限于Python级代码,因为它不会在C代码级消失。
问题是:
Where does the huge difference in speed of execution come from?
Is there anything that can be done to get both code variants to run at comparable speed? Is there something went wrong with the class/next implementation compared to the function/yield variant? Both are to my knowledge exactly the same code ...
这里的代码(调整突出显示的行中的数字会改变列表中元素的唯一性级别组合是从对 运行ning 时间有巨大影响的内容生成的):
def uniqCmboYieldIter(lstItems, lenCmbo):
dctCounter = {}
lenLstItems = len(lstItems)
for idx in range(lenLstItems):
item = lstItems[idx]
if item in dctCounter.keys():
dctCounter[item] += 1
else:
dctCounter[item] = 1
#:if
#:for
lstUniqs = sorted(dctCounter.keys())
lstCntRpts = [dctCounter[item] for item in lstUniqs]
lenUniqs = len(lstUniqs)
cmboAsIdxUniqs = [None] * lenCmbo
multiplicities = [0] * lenUniqs
idxIntoCmbo, idxIntoUniqs = 0, 0
while idxIntoCmbo != lenCmbo and idxIntoUniqs != lenUniqs:
count = min(lstCntRpts[idxIntoUniqs], lenCmbo-idxIntoCmbo)
cmboAsIdxUniqs[idxIntoCmbo : idxIntoCmbo + count] = [idxIntoUniqs] * count
multiplicities[idxIntoUniqs] = count
idxIntoCmbo += count
idxIntoUniqs += 1
if idxIntoCmbo != lenCmbo:
return
while True:
yield tuple(lstUniqs[idxUniqs] for idxUniqs in cmboAsIdxUniqs)
for idxIntoCmbo in reversed(range(lenCmbo)):
x = cmboAsIdxUniqs[idxIntoCmbo]
y = x + 1
if y < lenUniqs and multiplicities[y] < lstCntRpts[y]:
break
else:
return
for idxIntoCmbo in range(idxIntoCmbo, lenCmbo):
x = cmboAsIdxUniqs[idxIntoCmbo]
cmboAsIdxUniqs[idxIntoCmbo] = y
multiplicities[x] -= 1
multiplicities[y] += 1
# print("# multiplicities:", multiplicities)
while y != lenUniqs and multiplicities[y] == lstCntRpts[y]:
y += 1
if y == lenUniqs:
break
class uniqCmboClassIter:
# ----------------------------------------------------------------------------------------------
def __iter__(self):
return self
# ----------------------------------------------------------------------------------------------
def __init__(self, lstItems, lenCmbo):
dctCounter = {}
lenLstItems = len(lstItems)
for idx in range(lenLstItems):
item = lstItems[idx]
if item in dctCounter.keys():
dctCounter[item] += 1
else:
dctCounter[item] = 1
#:if
#:for
self.lstUniqs = sorted(dctCounter.keys())
self.lenUniqs = len(self.lstUniqs)
self.lstCntRpts = [dctCounter[item] for item in self.lstUniqs]
self.lenCmbo = lenCmbo
self.cmboAsIdxUniqs = [None] * lenCmbo
self.multiplicities = [0] * self.lenUniqs
self.idxIntoCmbo, self.idxIntoUniqs = 0, 0
while self.idxIntoCmbo != self.lenCmbo and self.idxIntoUniqs != self.lenUniqs:
count = min(self.lstCntRpts[self.idxIntoUniqs], self.lenCmbo-self.idxIntoCmbo)
self.cmboAsIdxUniqs[self.idxIntoCmbo : self.idxIntoCmbo + count] = [self.idxIntoUniqs] * count
self.multiplicities[self.idxIntoUniqs] = count
self.idxIntoCmbo += count
self.idxIntoUniqs += 1
# print("self.multiplicities:", self.multiplicities)
# print("self.cmboAsIdxUniqs:", self.cmboAsIdxUniqs)
if self.idxIntoCmbo != self.lenCmbo:
return
self.stopIteration = False
self.x = None
self.y = None
return
# ----------------------------------------------------------------------------------------------
def __next__(self):
if self.stopIteration is True:
raise StopIteration
return
nextCmbo = tuple(self.lstUniqs[idxUniqs] for idxUniqs in self.cmboAsIdxUniqs)
for self.idxIntoCmbo in reversed(range(self.lenCmbo)):
self.x = self.cmboAsIdxUniqs[self.idxIntoCmbo]
self.y = self.x + 1
if self.y < self.lenUniqs and self.multiplicities[self.y] < self.lstCntRpts[self.y]:
break
else:
self.stopIteration = True
return nextCmbo
for self.idxIntoCmbo in range(self.idxIntoCmbo, self.lenCmbo):
self.x = self.cmboAsIdxUniqs[self.idxIntoCmbo]
self.cmboAsIdxUniqs[self.idxIntoCmbo] = self.y
self.multiplicities[self.x] -= 1
self.multiplicities[self.y] += 1
# print("# multiplicities:", multiplicities)
while self.y != self.lenUniqs and self.multiplicities[self.y] == self.lstCntRpts[self.y]:
self.y += 1
if self.y == self.lenUniqs:
break
return nextCmbo
# ============================================================================================================================================
lstSize = 48 # 48
uniqLevel = 12 # (7 ~60% unique) higher level => more unique items in the generated list
aList = []
from random import randint
for _ in range(lstSize):
aList.append( ( randint(1,uniqLevel), randint(1,uniqLevel) ) )
lenCmbo = 6
percUnique = 100.0 - 100.0*(lstSize-len(set(aList)))/lstSize
print("======================== lenCmbo:", lenCmbo,
" sizeOfList:", len(aList),
" noOfUniqueInList", len(set(aList)),
" percUnique", int(percUnique) )
import time
from itertools import combinations
# itertools.combinations
# ---
# def uniqCmboYieldIter(lstItems, lenCmbo):
# class uniqCmboClassIter: def __init__(self, lstItems, lenCmbo):
# ---
start_time = time.time()
print("Combos:%9i"%len(list(combinations(aList, lenCmbo))), " ", end='')
duration = time.time() - start_time
print("print(len(list( combinations(aList, lenCmbo)))):", "{:9.5f}".format(duration), "seconds.")
start_time = time.time()
print("Combos:%9i"%len(list(uniqCmboYieldIter(aList, lenCmbo))), " ", end='')
duration = time.time() - start_time
print("print(len(list(uniqCmboYieldIter(aList, lenCmbo)))):", "{:9.5f}".format(duration), "seconds.")
start_time = time.time()
print("Combos:%9i"%len(list(uniqCmboClassIter(aList, lenCmbo))), " ", end='')
duration = time.time() - start_time
print("print(len(list(uniqCmboClassIter(aList, lenCmbo)))):", "{:9.5f}".format(duration), "seconds.")
我盒子上的时间:
>python3.6 -u "nonRecursiveUniqueCombos_Cg.py"
======================== lenCmbo: 6 sizeOfList: 48 noOfUniqueInList 32 percUnique 66
Combos: 12271512 print(len(list( combinations(aList, lenCmbo)))): 2.04635 seconds.
Combos: 1296058 print(len(list(uniqCmboYieldIter(aList, lenCmbo)))): 3.25447 seconds.
Combos: 1296058 print(len(list(uniqCmboClassIter(aList, lenCmbo)))): 5.97371 seconds.
>Exit code: 0
[2017-05-02_03:23] 207474 <-Chrs,Keys-> 1277194 OnSave(): '/home/claudio/CgMint18/_Cg.DIR/ClaudioOnline/at-Whosebug/bySubject/uniqueCombinations/nonRecursiveUniqueCombos_Cg.py'
>python3.6 -u "nonRecursiveUniqueCombos_Cg.py"
======================== lenCmbo: 6 sizeOfList: 48 noOfUniqueInList 22 percUnique 45
Combos: 12271512 print(len(list( combinations(aList, lenCmbo)))): 2.05199 seconds.
Combos: 191072 print(len(list(uniqCmboYieldIter(aList, lenCmbo)))): 0.47343 seconds.
Combos: 191072 print(len(list(uniqCmboClassIter(aList, lenCmbo)))): 0.89860 seconds.
>Exit code: 0
[2017-05-02_03:23] 207476 <-Chrs,Keys-> 1277202 OnSave(): '/home/claudio/CgMint18/_Cg.DIR/ClaudioOnline/at-Whosebug/bySubject/uniqueCombinations/nonRecursiveUniqueCombos_Cg.py'
>python3.6 -u "nonRecursiveUniqueCombos_Cg.py"
======================== lenCmbo: 6 sizeOfList: 48 noOfUniqueInList 43 percUnique 89
Combos: 12271512 print(len(list( combinations(aList, lenCmbo)))): 2.17285 seconds.
Combos: 6560701 print(len(list(uniqCmboYieldIter(aList, lenCmbo)))): 16.72573 seconds.
Combos: 6560701 print(len(list(uniqCmboClassIter(aList, lenCmbo)))): 31.17714 seconds.
>Exit code: 0
更新(状态 2017-05-07):
At the time of asking the question and offering a bounty it was not known to me that there is a way to easily create C code of an extension module for an iterator object out of Python script code using Cython and that such C code can be created also from an iterator function using yield
.
考虑到生成的更快版本的 C 扩展模块仍然不够快,无法与 itertools.combinations
竞争,深入了解究竟是什么导致了使用迭代器 class 与迭代器函数的比较以及如何克服这一点。找到一种方法来使用 Cython 来加速更快的版本更有意义,特别是因为我在编写 Python 扩展模块方面完全是新手,在经过数小时的集中工作后无法创建工作代码由于 Segmentation Fault
错误,我无法掌握原因,因此使用自己的修改调整 itertools.combinations
的现有 C 代码。
目前我认为我使用的 Cython 代码还有加速的空间,而不需要自己编写 C 代码的更难的方式。
低于 运行 正常的 Cython 代码和速度优化的 Cython 代码,它以某种方式改变(我目前看不到原因)算法的工作方式并因此产生错误的结果。 Cython 优化背后的想法是在 Cython 代码中使用 Python/Cython 数组而不是 Python 列表。欢迎任何提示如何以新手 "safe" 的方式从使用的算法中获得更快的 运行ning Python 扩展模块。
def subbags_by_loops_with_dict_counter(lstItems, int lenCmbo):
dctCounter = {}
cdef int lenLstItems = len(lstItems)
cdef int idx = 0
for idx in range(lenLstItems):
item = lstItems[idx]
if item in dctCounter.keys():
dctCounter[item] += 1
else:
dctCounter[item] = 1
#:if
#:for
lstUniqs = sorted(dctCounter.keys())
lstCntRpts = [dctCounter[item] for item in lstUniqs]
cdef int lenUniqs = len(lstUniqs)
cmboAsIdxUniqs = [None] * lenCmbo
multiplicities = [0] * lenUniqs
cdef int idxIntoCmbo
cdef int idxIntoUniqs
cdef int count
while idxIntoCmbo != lenCmbo and idxIntoUniqs != lenUniqs:
count = min(lstCntRpts[idxIntoUniqs], lenCmbo-idxIntoCmbo)
cmboAsIdxUniqs[idxIntoCmbo : idxIntoCmbo + count] = [idxIntoUniqs] * count
multiplicities[idxIntoUniqs] = count
idxIntoCmbo += count
idxIntoUniqs += 1
if idxIntoCmbo != lenCmbo:
return
cdef int x
cdef int y
while True:
yield tuple(lstUniqs[idxUniqs] for idxUniqs in cmboAsIdxUniqs)
for idxIntoCmbo in reversed(range(lenCmbo)):
x = cmboAsIdxUniqs[idxIntoCmbo]
y = x + 1
if y < lenUniqs and multiplicities[y] < lstCntRpts[y]:
break
else:
return
for idxIntoCmbo in range(idxIntoCmbo, lenCmbo):
x = cmboAsIdxUniqs[idxIntoCmbo]
cmboAsIdxUniqs[idxIntoCmbo] = y
multiplicities[x] -= 1
multiplicities[y] += 1
while y != lenUniqs and multiplicities[y] == lstCntRpts[y]:
y += 1
if y == lenUniqs:
break
以下 OPTIMIZED CYTHON CODE 产生错误结果:
def subbags_loops_dict_cython_optimized(lstItems, int lenCmbo):
dctCounter = {}
cdef int lenLstItems = len(lstItems)
cdef int idx = 0
for idx in range(lenLstItems):
item = lstItems[idx]
if item in dctCounter.keys():
dctCounter[item] += 1
else:
dctCounter[item] = 1
#:if
#:for
lstUniqs = sorted(dctCounter.keys())
lstCntRpts = [dctCounter[item] for item in lstUniqs]
cdef int lenUniqs = len(lstUniqs)
cdef array.array cmboAsIdxUniqs = array.array('i', [])
array.resize(cmboAsIdxUniqs, lenCmbo)
# cmboAsIdxUniqs = [None] * lenCmbo
cdef array.array multiplicities = array.array('i', [])
array.resize(multiplicities, lenUniqs)
# multiplicities = [0] * lenUniqs
cdef int idxIntoCmbo
cdef int maxIdxCmbo
cdef int curIdxCmbo
cdef int idxIntoUniqs
cdef int count
while idxIntoCmbo != lenCmbo and idxIntoUniqs != lenUniqs:
count = min(lstCntRpts[idxIntoUniqs], lenCmbo-idxIntoCmbo)
maxIdxCmbo = idxIntoCmbo + count
curIdxCmbo = idxIntoCmbo
while curIdxCmbo < maxIdxCmbo:
cmboAsIdxUniqs[curIdxCmbo] = idxIntoUniqs
curIdxCmbo += 1
multiplicities[idxIntoUniqs] = count
idxIntoCmbo += count
idxIntoUniqs += 1
# print("multiplicities:", multiplicities)
# print("cmboAsIdxUniqs:", cmboAsIdxUniqs)
if idxIntoCmbo != lenCmbo:
return
cdef int x
cdef int y
while True:
yield tuple(lstUniqs[idxUniqs] for idxUniqs in cmboAsIdxUniqs)
for idxIntoCmbo in reversed(range(lenCmbo)):
x = cmboAsIdxUniqs[idxIntoCmbo]
y = x + 1
if y < lenUniqs and multiplicities[y] < lstCntRpts[y]:
break
else:
return
for idxIntoCmbo in range(idxIntoCmbo, lenCmbo):
x = cmboAsIdxUniqs[idxIntoCmbo]
cmboAsIdxUniqs[idxIntoCmbo] = y
multiplicities[x] -= 1
multiplicities[y] += 1
# print("# multiplicities:", multiplicities)
while y != lenUniqs and multiplicities[y] == lstCntRpts[y]:
y += 1
if y == lenUniqs:
break
The class with __next__
version is the one suitable to be implemented
as a Python extension module because there is no equivalent of yield
in C, so it makes sense to find out how it could be improved in order
to perform comparable to the function with yield variant.
已经用C写好了。您看到的性能差异完全是由于 Python 实现的属性不适用于您计划编写的 C 扩展模块。您可以应用于 Python class 的优化不适用于 C 代码。
例如,在 Python 代码中访问实例变量比访问局部变量更昂贵,因为访问实例变量需要多次字典查找。您的 C 实现不需要这样的字典查找。
当您使用 yield
编写生成器函数时,保存关闭和恢复状态的开销由 CPython 内部处理(在 C 中实现)。使用 __iter__
/__next__
,您必须在每次调用时管理保存和恢复状态。在 CPython 中,Python 级代码比 C 级内置代码慢,因此 extr Python 级代码涉及状态管理(包括像访问 [=13 的属性一样简单的东西) =] 通过 dict
查找而不是加载局部变量,只有数组索引开销)最终会花费你很多。
如果您在 C 扩展模块中实现自己的迭代器协议支持类型,您将绕过此开销;保存和恢复状态应该是几个 C 级变量访问的问题(与 Python 生成器函数产生的开销相比,开销相似或更少,也就是说,非常少)。实际上,这就是生成器函数 是 ,一种 C 扩展类型,它在每次调用 tp_iternext
时保存和恢复 Python 帧(相当于 __next__
).
将itertools文档的一些recipes重写为C扩展时,我取得了一些经验。我想我可能有一些见解可以帮助你。
生成器与迭代器class。
当您编写纯 Python 代码时,它是速度(生成器)和功能(迭代器)之间的权衡。
yield
函数(称为生成器)是为了提高速度,通常可以在不考虑内部状态的情况下编写它们。所以编写它们更省力,而且速度很快,因为 Python 只管理所有 "state"。
生成器更快(或至少不慢)的原因主要是因为:
除了__next__
-方法之外,他们还直接实现了__next__
-slot(通常是tp_iternext
)。在那种情况下 Python 不必查找 __next__
方法 - 这基本上是在以下示例中使其更快的原因:
from itertools import islice
def test():
while True:
yield 1
class Test(object):
def __iter__(self):
return self
def __next__(self):
return 1
%timeit list(islice(test(), 1000))
# 173 µs ± 2.15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit list(islice(Test(), 1000))
# 499 µs ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
所以它几乎快了 3 倍,因为生成器直接填充 __next__
-slot。
A yield
-函数和 class 有一个状态,但是 yield
函数保存和加载状态的速度比使用 class 和属性访问:
def test():
i = 0
while True:
yield i
i += 1
class Test(object):
def __init__(self):
self.val = 0
def __iter__(self):
return self
def __next__(self):
current = self.val
self.val += 1
return current
%timeit list(islice(test(), 1000))
# 296 µs ± 1.73 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit list(islice(Test(), 1000))
# 1.22 ms ± 3.12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
这一次 class 已经慢了 4 倍(与几乎 3 倍相比,当不涉及任何状态时)。这是一个累积效应:所以你拥有的 "state" 越多,class 变体就会越慢。
yield
与 class 方法就这么多了。请注意,实际时间将取决于操作的种类。例如,如果调用 next
时 运行 的实际代码是 slow(即 time.sleep(1)
),那么生成器和 class!
赛通
如果你想要一个 cython 迭代器 class 即 fast 它必须是 cdef class
。否则你不会得到真正快的 class。原因是只有一个cdef class
创建了一个直接实现tp_iternext
字段的扩展类型!我将使用 IPythons %%cython
来编译代码(因此我不必包含设置):
%%cython
def test():
while True:
yield 1
class Test(object):
def __iter__(self):
return self
def __next__(self):
return 1
cdef class Test_cdef(object):
def __iter__(self):
return self
def __next__(self):
return 1
%timeit list(islice(test(), 1000))
# 113 µs ± 4.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit list(islice(Test(), 1000))
# 407 µs ± 16.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit list(islice(Test_cdef(), 1000))
# 62.8 µs ± 2.46 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
时间已经表明生成器和基本 class 比纯 Python 等效项更快,但它们的相对性能大致保持不变。然而 cdef class
变体击败了它们,这主要是因为使用了 tp_iternext
插槽而不是仅仅实现 __next__
方法。 (如果您不相信我,请检查 Cython 生成的 C 代码 :))
然而,它只比 Python 生成器快 2 倍,这还不错,但也不是完全压倒性的。要获得真正惊人的加速,您需要找到一种方法来表达您的程序 而无需 Python 对象 (Python 对象越少,加速越大)。例如,如果您使用字典来存储项目并且它是多重的,您仍然存储 Python 对象并且任何查找都必须使用 python 字典方法完成 - 即使您可以通过 C API 函数而不必查找真正的方法:
%%cython
cpdef cython_count(items):
cdef dict res = dict()
for item in items:
if item in res:
res[item] += 1
else:
res[item] = 1
return res
import random
def count(items):
res = {}
for item in items:
if item in res:
res[item] += 1
else:
res[item] = 1
return res
l = [random.randint(0, 100) for _ in range(10000)]
%timeit cython_count(l)
# 2.06 ms ± 13 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit count(l)
# 3.63 ms ± 21.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
这里有一个问题,您没有使用具有优化 C 代码(至少在 python-3 中)的 collections.Counter
进行此类操作:
from collections import Counter
%timeit Counter(l)
# 1.17 ms ± 41.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
这里有一个简短的说明:不要使用 something in some_dict.keys()
,因为 keys()
在 Python2 中是类似列表的,并且只有 O(n)
包含操作,而 something in some_dict
通常是 O(1)
(都是 Python)!这将使两个版本的速度更快,尤其是 Python2:
def count2(items):
res = {}
for item in items:
if item in res.keys(): # with "keys()"
res[item] += 1
else:
res[item] = 1
return res
# Python3
l = [random.randint(0, 100) for _ in range(10000)]
%timeit count(l)
# 3.63 ms ± 29 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit count2(l)
# 5.9 ms ± 20 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Python2
l = [random.randint(0, 10000) for _ in range(10000)]
%timeit count(l)
# 100 loops, best of 3: 4.59 ms per loop
%timeit count2(l)
# 1 loop, best of 3: 2.65 s per loop <--- WHOOPS!!!
这表明当您使用 python 结构时,您只能希望使用 Cython(和 C 扩展)实现 3-4 倍的加速,但即使是使用“.keys()”这样的小错误也会花费如果使用不当,您 在性能方面 多得多。
优化 Cython
如果你想让它更快,你能做什么?答案相对简单:基于 C 类型而不是 Python 类型创建您自己的数据结构。
这意味着你必须考虑设计:
- 您想在
uniqComb**
中支持哪些类型?你想要整数吗(例子是这么说的,但我想你想要任意 Python 对象)。
- 您想从 Python 进行自省(例如当前状态)吗?如果你想要它,将多重性保持为 python 对象是有意义的,但如果你不在乎,你可以将它们保存为类似整数的对象而不是 python 对象。
- 您需要传递给
uniqComb**
函数的对象是可排序的吗?您使用了 sorted
,但您也可以使用 OrderedDict
并按出现顺序而不是按数值保留键。
这些问题的答案(这些只是我立即问自己的问题,可能还有更多!)可以帮助您决定可以在内部使用哪种结构。例如,使用 Cython,您可以连接到 C++,并且可以使用包含整数键和整数值的 map
而不是字典。它是默认排序的,因此您无需自己手动对它们进行排序,您可以对本机整数而不是 Python 对象进行操作。但是您失去了在 uniqComb
中处理任意 python 对象的能力,并且您需要知道如何在 Cython 中使用 C++ 类型进行操作。不过它可能会快得惊人!
我没有走那条路,因为我假设你想支持任意可排序的 python 类型,我坚持使用 Counter
作为起点,但我会将重数保存为整数array.array
s 而不是 list
。我们称它为 "least invasive" 优化。如果您对 lstCntRpts
和 multiplicities
使用 list
或 array
,实际上在性能方面并不重要,因为它们不是瓶颈 - 但它是更快一点并节省一点内存 和 更重要的是,它展示了如何使用 cython 包含同质 array
s:
%%cython
from cpython.list cimport PyList_Size # (most) C API functions can be used with cython!
from array import array
from collections import Counter
cdef class uniqCmboClassIter:
cdef list lstUniqs
cdef Py_ssize_t lenUniqs
cdef int[:] lstCntRpts # memoryview
cdef Py_ssize_t lenCmbo
cdef list cmboAsIdxUniqs
cdef int[:] multiplicities # memoryview
cdef Py_ssize_t idxIntoCmbo
cdef Py_ssize_t idxIntoUniqs
cdef bint stopIteration
cdef Py_ssize_t x
cdef Py_ssize_t y
def __init__(self, lstItems, lenCmbo):
dctCounter = Counter(lstItems)
self.lstUniqs = sorted(dctCounter)
self.lenUniqs = PyList_Size(self.lstUniqs)
self.lstCntRpts = array('i', [dctCounter[item] for item in self.lstUniqs])
self.lenCmbo = lenCmbo
self.cmboAsIdxUniqs = [None] * lenCmbo
self.multiplicities = array('i', [0] * self.lenUniqs)
self.idxIntoCmbo, self.idxIntoUniqs = 0, 0
while self.idxIntoCmbo != self.lenCmbo and self.idxIntoUniqs != self.lenUniqs:
count = min(self.lstCntRpts[self.idxIntoUniqs], self.lenCmbo-self.idxIntoCmbo)
self.cmboAsIdxUniqs[self.idxIntoCmbo : self.idxIntoCmbo + count] = [self.idxIntoUniqs] * count
self.multiplicities[self.idxIntoUniqs] = count
self.idxIntoCmbo += count
self.idxIntoUniqs += 1
# print("self.multiplicities:", self.multiplicities)
# print("self.cmboAsIdxUniqs:", self.cmboAsIdxUniqs)
if self.idxIntoCmbo != self.lenCmbo:
return
self.stopIteration = False
self.x = 0
self.y = 0
return
def __iter__(self):
return self
def __next__(self):
if self.stopIteration is True:
raise StopIteration
nextCmbo = tuple(self.lstUniqs[idxUniqs] for idxUniqs in self.cmboAsIdxUniqs)
for self.idxIntoCmbo in reversed(range(self.lenCmbo)):
self.x = self.cmboAsIdxUniqs[self.idxIntoCmbo]
self.y = self.x + 1
if self.y < self.lenUniqs and self.multiplicities[self.y] < self.lstCntRpts[self.y]:
break
else:
self.stopIteration = True
return nextCmbo
for self.idxIntoCmbo in range(self.idxIntoCmbo, self.lenCmbo):
self.x = self.cmboAsIdxUniqs[self.idxIntoCmbo]
self.cmboAsIdxUniqs[self.idxIntoCmbo] = self.y
self.multiplicities[self.x] -= 1
self.multiplicities[self.y] += 1
# print("# multiplicities:", multiplicities)
while self.y != self.lenUniqs and self.multiplicities[self.y] == self.lstCntRpts[self.y]:
self.y += 1
if self.y == self.lenUniqs:
break
return nextCmbo
你实际上没有分享你的时间参数,但我用我的一些参数试过了:
from itertools import combinations
import random
import time
def create_values(maximum):
vals = [random.randint(0, maximum) for _ in range(48)]
print('length: ', len(vals))
print('sorted values: ', sorted(vals))
print('uniques: ', len(set(vals)))
print('uniques in percent: {:%}'.format(len(set(vals)) / len(vals)))
return vals
class Timer(object):
def __init__(self):
pass
def __enter__(self):
self._time = time.time()
def __exit__(self, *args, **kwargs):
print(time.time() - self._time)
vals = create_values(maximum=50) # and 22 and 75 and 120
n = 6
with Timer():
list(combinations(vals, n))
with Timer():
list(uniqCmboClassIter(vals, n))
with Timer():
list(uniqCmboClassIterOriginal(vals, n))
with Timer():
list(uniqCmboYieldIterOriginal(vals, n))
length: 48
sorted values: [0, 0, 0, 1, 2, 2, 4, 5, 5, 6, 6, 6, 7, 7, 7, 8, 8, 8, 8, 9, 9, 10, 11, 11, 12, 12, 12, 13, 13, 14, 14, 14, 15, 15, 15, 17, 18, 19, 19, 19, 19, 20, 20, 20, 21, 21, 22, 22]
uniques: 21
uniques in percent: 43.750000%
6.250450611114502
0.4217393398284912
4.250436305999756
2.7186365127563477
length: 48
sorted values: [1, 1, 2, 5, 6, 7, 7, 8, 8, 9, 11, 13, 13, 15, 16, 16, 16, 16, 17, 19, 19, 21, 21, 23, 24, 26, 27, 28, 28, 29, 31, 31, 34, 34, 36, 36, 38, 39, 39, 40, 41, 42, 44, 46, 47, 47, 49, 50]
uniques: 33
uniques in percent: 68.750000%
6.2034173011779785
4.343803882598877
42.39261245727539
26.65750527381897
length: 48
sorted values: [4, 4, 7, 9, 10, 14, 14, 17, 19, 21, 23, 24, 24, 26, 34, 36, 40, 42, 43, 43, 45, 46, 46, 52, 53, 58, 59, 59, 61, 63, 66, 68, 71, 72, 72, 75, 76, 80, 82, 82, 83, 84, 86, 86, 89, 92, 97, 99]
uniques: 39
uniques in percent: 81.250000%
6.859697341918945
10.437987327575684
104.12988543510437
65.25306582450867
length: 48
sorted values: [4, 7, 11, 19, 24, 29, 32, 36, 49, 49, 54, 57, 58, 60, 62, 65, 67, 70, 70, 72, 72, 79, 82, 83, 86, 89, 89, 90, 91, 94, 96, 99, 102, 111, 112, 118, 120, 120, 128, 129, 129, 134, 138, 141, 141, 144, 146, 147]
uniques: 41
uniques in percent: 85.416667%
6.484673023223877
13.610010623931885
136.28764533996582
84.73834943771362
它绝对比原来的方法执行得更好,实际上 just 类型声明快了好几倍。可能还有很多可以优化的地方(禁用边界检查,使用 Python C API 函数调用,使用无符号整数或更小的整数,如果你知道 "maximum" 和 "minimum"你的多重性,...) - 但事实上它并不比 itertools.combinations
慢很多,即使对于 80% 的独特项目,而且比任何原始实现都快得多,这对我来说已经足够好了。 :-)
UPDATE(反映最先进的知识水平)状态:2017-05-12
进行此更新的原因是,在我问这个问题时,我并不知道我发现了一些关于 Python3 如何工作的东西 "under the hood"。
所有接下来的结论是:
If you write own Python3 code for an iterator and care about speed of execution you should write it as a generator function and not as an iterator class.
下面是一个简单的代码示例,演示了相同的算法 (此处:Pythons range()
的自制版本) 表示为生成函数运行s 比表示为迭代器 class 快得多:
def gnrtYieldRange(startWith, endAt, step=1):
while startWith <= endAt:
yield startWith
startWith += step
class iterClassRange:
def __init__(self, startWith, endAt, step=1):
self.startWith = startWith - 1
self.endAt = endAt
self.step = step
def __iter__(self):
return self
def __next__(self):
self.startWith += self.step
if self.startWith <= self.endAt:
return self.startWith
else:
raise StopIteration
N = 10000000
print(" Size of created list N = {} elements (ints 1 to N)".format(N))
from time import time as t
from customRange import gnrtYieldRange as cthnYieldRange
from customRange import cintYieldRange
from customRange import iterClassRange as cthnClassRange
from customRange import cdefClassRange
iterPythnRangeObj = range(1, N+1)
gnrtYieldRangeObj = gnrtYieldRange(1, N)
cthnYieldRangeObj = cthnYieldRange(1, N)
cintYieldRangeObj = cintYieldRange(1, N)
iterClassRangeObj = iterClassRange(1, N)
cthnClassRangeObj = cthnClassRange(1, N)
cdefClassRangeObj = cdefClassRange(1, N)
sEXECs = [
"liPR = list(iterPythnRangeObj)",
"lgYR = list(gnrtYieldRangeObj)",
"lcYR = list(cthnYieldRangeObj)",
"liGR = list(cintYieldRangeObj)",
"liCR = list(iterClassRangeObj)",
"lcCR = list(cthnClassRangeObj)",
"ldCR = list(cdefClassRangeObj)"
]
sCOMMENTs = [
"Python3 own range(1, N+1) used here as reference for timings ",
"self-made range generator function using yield (run as it is) ",
"self-made range (with yield) run from module created by Cython",
"Cython-optimized self-made range (using yield) run from module",
"self-made range as iterator class using __next__() and return ",
"self-made range (using __next__) from module created by Cython",
"Cython-optimized self-made range (using __next__) from module "
]
for idx, sEXEC in enumerate(sEXECs):
s=t();exec(sEXEC);e=t();print("{} takes: {:3.1f} sec.".format(sCOMMENTs[idx], e-s))
print("All created lists are equal:", all([liPR == lgYR, lgYR == lcYR, lcYR == liGR, liGR == liCR, liCR == lcCR, lcCR == ldCR]) )
print("Run on Linux Mint 18.1, used Cython.__version__ == '0.25.2'")
上面的代码放入一个文件,运行ned 打印到标准输出:
>python3.6 -u "gnrtFunction-fasterThan-iterClass_runMe.py"
Size of created list N = 10000000 elements (ints 1 to N)
Python3 own range(1, N+1) used here as reference for timings takes: 0.2 sec.
self-made range generator function using yield (run as it is) takes: 1.1 sec.
self-made range (with yield) run from module created by Cython takes: 0.5 sec.
Cython-optimized self-made range (using yield) run from module takes: 0.3 sec.
self-made range as iterator class using __next__() and return takes: 3.9 sec.
self-made range (using __next__) from module created by Cython takes: 3.3 sec.
Cython-optimized self-made range (using __next__) from module takes: 0.2 sec.
All created lists are equal: True
Run on Linux Mint 18.1, used Cython.__version__ == '0.25.2'
>Exit code: 0
从上面的时间可以看出,自制range()
迭代器运行的生成器函数变体比迭代器class变体快,并且没有优化代码涉及此行为也会传播到 Cython 创建的 C 代码的 C 代码级别。
如果您好奇为什么会这样,您可以阅读提供的答案或自己玩一下提供的代码。
下面缺少 运行 上面代码所必需的代码片段:
customRange.pyx
- 文件 Cython 从以下位置创建 customRange
模块:
def gnrtYieldRange(startWith, endAt, step=1):
while startWith <= endAt:
yield startWith
startWith += step
class iterClassRange:
def __init__(self, startWith, endAt, step=1):
self.startWith = startWith - 1
self.endAt = endAt
self.step = step
def __iter__(self):
return self
def __next__(self):
self.startWith += self.step
if self.startWith <= self.endAt:
return self.startWith
else:
raise StopIteration
def cintYieldRange(int startWith, int endAt, int step=1):
while startWith <= endAt:
yield startWith
startWith += step
cdef class cdefClassRange:
cdef int startWith
cdef int endAt
cdef int step
def __init__(self, int startWith, int endAt, int step=1):
self.startWith = startWith - 1
self.endAt = endAt
self.step = step
def __iter__(self):
return self
def __next__(self):
self.startWith += self.step
if self.startWith <= self.endAt:
return self.startWith
else:
raise StopIteration
和用于创建 Python customRange
模块的安装文件 customRange-setup.py
:
import sys
sys.argv += ['build_ext', '--inplace']
from distutils.core import setup
from Cython.Build import cythonize
setup(
name = 'customRange',
ext_modules = cythonize("customRange.pyx"),
)
现在一些进一步的信息可以更容易地理解所提供的答案:
当时我问这个问题我正忙于一个相当复杂的问题
使用 yield
从以生成函数形式提供的非唯一列表生成唯一组合的算法。我的目标是使用此算法创建一个用 C 编写的 Python 模块,以使其 运行 更快。为此,我使用 __next__()
和 return
将使用 yield
的生成器函数重写为迭代器 class。当我比较算法的两种变体的速度时,令我惊讶的是迭代器 class 比生成器函数慢两倍,我曾(错误地)假设它有一些东西与我重写算法的方式有关(如果你想更好地理解这里的答案是什么,你需要知道这个)因此
Originally asked how to make the iterator class version run at the same speed as the generator function and where the speed difference comes from?.
下面是一些关于历史问题的更多信息:
在下面提供的 Python 脚本代码中,使用 Python function
和 yield
并使用 class
和 __next__
。代码在 copy/paste 之后准备好 运行,所以你可以自己看看我在说什么。
对于纯 Python 代码观察到的相同现象传播到 Python 扩展模块的 C 代码中,扩展模块由 Cython 创建,因此它不限于Python级代码,因为它不会在C代码级消失。
问题是:
Where does the huge difference in speed of execution come from? Is there anything that can be done to get both code variants to run at comparable speed? Is there something went wrong with the class/next implementation compared to the function/yield variant? Both are to my knowledge exactly the same code ...
这里的代码(调整突出显示的行中的数字会改变列表中元素的唯一性级别组合是从对 运行ning 时间有巨大影响的内容生成的):
def uniqCmboYieldIter(lstItems, lenCmbo):
dctCounter = {}
lenLstItems = len(lstItems)
for idx in range(lenLstItems):
item = lstItems[idx]
if item in dctCounter.keys():
dctCounter[item] += 1
else:
dctCounter[item] = 1
#:if
#:for
lstUniqs = sorted(dctCounter.keys())
lstCntRpts = [dctCounter[item] for item in lstUniqs]
lenUniqs = len(lstUniqs)
cmboAsIdxUniqs = [None] * lenCmbo
multiplicities = [0] * lenUniqs
idxIntoCmbo, idxIntoUniqs = 0, 0
while idxIntoCmbo != lenCmbo and idxIntoUniqs != lenUniqs:
count = min(lstCntRpts[idxIntoUniqs], lenCmbo-idxIntoCmbo)
cmboAsIdxUniqs[idxIntoCmbo : idxIntoCmbo + count] = [idxIntoUniqs] * count
multiplicities[idxIntoUniqs] = count
idxIntoCmbo += count
idxIntoUniqs += 1
if idxIntoCmbo != lenCmbo:
return
while True:
yield tuple(lstUniqs[idxUniqs] for idxUniqs in cmboAsIdxUniqs)
for idxIntoCmbo in reversed(range(lenCmbo)):
x = cmboAsIdxUniqs[idxIntoCmbo]
y = x + 1
if y < lenUniqs and multiplicities[y] < lstCntRpts[y]:
break
else:
return
for idxIntoCmbo in range(idxIntoCmbo, lenCmbo):
x = cmboAsIdxUniqs[idxIntoCmbo]
cmboAsIdxUniqs[idxIntoCmbo] = y
multiplicities[x] -= 1
multiplicities[y] += 1
# print("# multiplicities:", multiplicities)
while y != lenUniqs and multiplicities[y] == lstCntRpts[y]:
y += 1
if y == lenUniqs:
break
class uniqCmboClassIter:
# ----------------------------------------------------------------------------------------------
def __iter__(self):
return self
# ----------------------------------------------------------------------------------------------
def __init__(self, lstItems, lenCmbo):
dctCounter = {}
lenLstItems = len(lstItems)
for idx in range(lenLstItems):
item = lstItems[idx]
if item in dctCounter.keys():
dctCounter[item] += 1
else:
dctCounter[item] = 1
#:if
#:for
self.lstUniqs = sorted(dctCounter.keys())
self.lenUniqs = len(self.lstUniqs)
self.lstCntRpts = [dctCounter[item] for item in self.lstUniqs]
self.lenCmbo = lenCmbo
self.cmboAsIdxUniqs = [None] * lenCmbo
self.multiplicities = [0] * self.lenUniqs
self.idxIntoCmbo, self.idxIntoUniqs = 0, 0
while self.idxIntoCmbo != self.lenCmbo and self.idxIntoUniqs != self.lenUniqs:
count = min(self.lstCntRpts[self.idxIntoUniqs], self.lenCmbo-self.idxIntoCmbo)
self.cmboAsIdxUniqs[self.idxIntoCmbo : self.idxIntoCmbo + count] = [self.idxIntoUniqs] * count
self.multiplicities[self.idxIntoUniqs] = count
self.idxIntoCmbo += count
self.idxIntoUniqs += 1
# print("self.multiplicities:", self.multiplicities)
# print("self.cmboAsIdxUniqs:", self.cmboAsIdxUniqs)
if self.idxIntoCmbo != self.lenCmbo:
return
self.stopIteration = False
self.x = None
self.y = None
return
# ----------------------------------------------------------------------------------------------
def __next__(self):
if self.stopIteration is True:
raise StopIteration
return
nextCmbo = tuple(self.lstUniqs[idxUniqs] for idxUniqs in self.cmboAsIdxUniqs)
for self.idxIntoCmbo in reversed(range(self.lenCmbo)):
self.x = self.cmboAsIdxUniqs[self.idxIntoCmbo]
self.y = self.x + 1
if self.y < self.lenUniqs and self.multiplicities[self.y] < self.lstCntRpts[self.y]:
break
else:
self.stopIteration = True
return nextCmbo
for self.idxIntoCmbo in range(self.idxIntoCmbo, self.lenCmbo):
self.x = self.cmboAsIdxUniqs[self.idxIntoCmbo]
self.cmboAsIdxUniqs[self.idxIntoCmbo] = self.y
self.multiplicities[self.x] -= 1
self.multiplicities[self.y] += 1
# print("# multiplicities:", multiplicities)
while self.y != self.lenUniqs and self.multiplicities[self.y] == self.lstCntRpts[self.y]:
self.y += 1
if self.y == self.lenUniqs:
break
return nextCmbo
# ============================================================================================================================================
lstSize = 48 # 48
uniqLevel = 12 # (7 ~60% unique) higher level => more unique items in the generated list
aList = []
from random import randint
for _ in range(lstSize):
aList.append( ( randint(1,uniqLevel), randint(1,uniqLevel) ) )
lenCmbo = 6
percUnique = 100.0 - 100.0*(lstSize-len(set(aList)))/lstSize
print("======================== lenCmbo:", lenCmbo,
" sizeOfList:", len(aList),
" noOfUniqueInList", len(set(aList)),
" percUnique", int(percUnique) )
import time
from itertools import combinations
# itertools.combinations
# ---
# def uniqCmboYieldIter(lstItems, lenCmbo):
# class uniqCmboClassIter: def __init__(self, lstItems, lenCmbo):
# ---
start_time = time.time()
print("Combos:%9i"%len(list(combinations(aList, lenCmbo))), " ", end='')
duration = time.time() - start_time
print("print(len(list( combinations(aList, lenCmbo)))):", "{:9.5f}".format(duration), "seconds.")
start_time = time.time()
print("Combos:%9i"%len(list(uniqCmboYieldIter(aList, lenCmbo))), " ", end='')
duration = time.time() - start_time
print("print(len(list(uniqCmboYieldIter(aList, lenCmbo)))):", "{:9.5f}".format(duration), "seconds.")
start_time = time.time()
print("Combos:%9i"%len(list(uniqCmboClassIter(aList, lenCmbo))), " ", end='')
duration = time.time() - start_time
print("print(len(list(uniqCmboClassIter(aList, lenCmbo)))):", "{:9.5f}".format(duration), "seconds.")
我盒子上的时间:
>python3.6 -u "nonRecursiveUniqueCombos_Cg.py"
======================== lenCmbo: 6 sizeOfList: 48 noOfUniqueInList 32 percUnique 66
Combos: 12271512 print(len(list( combinations(aList, lenCmbo)))): 2.04635 seconds.
Combos: 1296058 print(len(list(uniqCmboYieldIter(aList, lenCmbo)))): 3.25447 seconds.
Combos: 1296058 print(len(list(uniqCmboClassIter(aList, lenCmbo)))): 5.97371 seconds.
>Exit code: 0
[2017-05-02_03:23] 207474 <-Chrs,Keys-> 1277194 OnSave(): '/home/claudio/CgMint18/_Cg.DIR/ClaudioOnline/at-Whosebug/bySubject/uniqueCombinations/nonRecursiveUniqueCombos_Cg.py'
>python3.6 -u "nonRecursiveUniqueCombos_Cg.py"
======================== lenCmbo: 6 sizeOfList: 48 noOfUniqueInList 22 percUnique 45
Combos: 12271512 print(len(list( combinations(aList, lenCmbo)))): 2.05199 seconds.
Combos: 191072 print(len(list(uniqCmboYieldIter(aList, lenCmbo)))): 0.47343 seconds.
Combos: 191072 print(len(list(uniqCmboClassIter(aList, lenCmbo)))): 0.89860 seconds.
>Exit code: 0
[2017-05-02_03:23] 207476 <-Chrs,Keys-> 1277202 OnSave(): '/home/claudio/CgMint18/_Cg.DIR/ClaudioOnline/at-Whosebug/bySubject/uniqueCombinations/nonRecursiveUniqueCombos_Cg.py'
>python3.6 -u "nonRecursiveUniqueCombos_Cg.py"
======================== lenCmbo: 6 sizeOfList: 48 noOfUniqueInList 43 percUnique 89
Combos: 12271512 print(len(list( combinations(aList, lenCmbo)))): 2.17285 seconds.
Combos: 6560701 print(len(list(uniqCmboYieldIter(aList, lenCmbo)))): 16.72573 seconds.
Combos: 6560701 print(len(list(uniqCmboClassIter(aList, lenCmbo)))): 31.17714 seconds.
>Exit code: 0
更新(状态 2017-05-07):
At the time of asking the question and offering a bounty it was not known to me that there is a way to easily create C code of an extension module for an iterator object out of Python script code using Cython and that such C code can be created also from an iterator function using
yield
.
考虑到生成的更快版本的 C 扩展模块仍然不够快,无法与 itertools.combinations
竞争,深入了解究竟是什么导致了使用迭代器 class 与迭代器函数的比较以及如何克服这一点。找到一种方法来使用 Cython 来加速更快的版本更有意义,特别是因为我在编写 Python 扩展模块方面完全是新手,在经过数小时的集中工作后无法创建工作代码由于 Segmentation Fault
错误,我无法掌握原因,因此使用自己的修改调整 itertools.combinations
的现有 C 代码。
目前我认为我使用的 Cython 代码还有加速的空间,而不需要自己编写 C 代码的更难的方式。
低于 运行 正常的 Cython 代码和速度优化的 Cython 代码,它以某种方式改变(我目前看不到原因)算法的工作方式并因此产生错误的结果。 Cython 优化背后的想法是在 Cython 代码中使用 Python/Cython 数组而不是 Python 列表。欢迎任何提示如何以新手 "safe" 的方式从使用的算法中获得更快的 运行ning Python 扩展模块。
def subbags_by_loops_with_dict_counter(lstItems, int lenCmbo):
dctCounter = {}
cdef int lenLstItems = len(lstItems)
cdef int idx = 0
for idx in range(lenLstItems):
item = lstItems[idx]
if item in dctCounter.keys():
dctCounter[item] += 1
else:
dctCounter[item] = 1
#:if
#:for
lstUniqs = sorted(dctCounter.keys())
lstCntRpts = [dctCounter[item] for item in lstUniqs]
cdef int lenUniqs = len(lstUniqs)
cmboAsIdxUniqs = [None] * lenCmbo
multiplicities = [0] * lenUniqs
cdef int idxIntoCmbo
cdef int idxIntoUniqs
cdef int count
while idxIntoCmbo != lenCmbo and idxIntoUniqs != lenUniqs:
count = min(lstCntRpts[idxIntoUniqs], lenCmbo-idxIntoCmbo)
cmboAsIdxUniqs[idxIntoCmbo : idxIntoCmbo + count] = [idxIntoUniqs] * count
multiplicities[idxIntoUniqs] = count
idxIntoCmbo += count
idxIntoUniqs += 1
if idxIntoCmbo != lenCmbo:
return
cdef int x
cdef int y
while True:
yield tuple(lstUniqs[idxUniqs] for idxUniqs in cmboAsIdxUniqs)
for idxIntoCmbo in reversed(range(lenCmbo)):
x = cmboAsIdxUniqs[idxIntoCmbo]
y = x + 1
if y < lenUniqs and multiplicities[y] < lstCntRpts[y]:
break
else:
return
for idxIntoCmbo in range(idxIntoCmbo, lenCmbo):
x = cmboAsIdxUniqs[idxIntoCmbo]
cmboAsIdxUniqs[idxIntoCmbo] = y
multiplicities[x] -= 1
multiplicities[y] += 1
while y != lenUniqs and multiplicities[y] == lstCntRpts[y]:
y += 1
if y == lenUniqs:
break
以下 OPTIMIZED CYTHON CODE 产生错误结果:
def subbags_loops_dict_cython_optimized(lstItems, int lenCmbo):
dctCounter = {}
cdef int lenLstItems = len(lstItems)
cdef int idx = 0
for idx in range(lenLstItems):
item = lstItems[idx]
if item in dctCounter.keys():
dctCounter[item] += 1
else:
dctCounter[item] = 1
#:if
#:for
lstUniqs = sorted(dctCounter.keys())
lstCntRpts = [dctCounter[item] for item in lstUniqs]
cdef int lenUniqs = len(lstUniqs)
cdef array.array cmboAsIdxUniqs = array.array('i', [])
array.resize(cmboAsIdxUniqs, lenCmbo)
# cmboAsIdxUniqs = [None] * lenCmbo
cdef array.array multiplicities = array.array('i', [])
array.resize(multiplicities, lenUniqs)
# multiplicities = [0] * lenUniqs
cdef int idxIntoCmbo
cdef int maxIdxCmbo
cdef int curIdxCmbo
cdef int idxIntoUniqs
cdef int count
while idxIntoCmbo != lenCmbo and idxIntoUniqs != lenUniqs:
count = min(lstCntRpts[idxIntoUniqs], lenCmbo-idxIntoCmbo)
maxIdxCmbo = idxIntoCmbo + count
curIdxCmbo = idxIntoCmbo
while curIdxCmbo < maxIdxCmbo:
cmboAsIdxUniqs[curIdxCmbo] = idxIntoUniqs
curIdxCmbo += 1
multiplicities[idxIntoUniqs] = count
idxIntoCmbo += count
idxIntoUniqs += 1
# print("multiplicities:", multiplicities)
# print("cmboAsIdxUniqs:", cmboAsIdxUniqs)
if idxIntoCmbo != lenCmbo:
return
cdef int x
cdef int y
while True:
yield tuple(lstUniqs[idxUniqs] for idxUniqs in cmboAsIdxUniqs)
for idxIntoCmbo in reversed(range(lenCmbo)):
x = cmboAsIdxUniqs[idxIntoCmbo]
y = x + 1
if y < lenUniqs and multiplicities[y] < lstCntRpts[y]:
break
else:
return
for idxIntoCmbo in range(idxIntoCmbo, lenCmbo):
x = cmboAsIdxUniqs[idxIntoCmbo]
cmboAsIdxUniqs[idxIntoCmbo] = y
multiplicities[x] -= 1
multiplicities[y] += 1
# print("# multiplicities:", multiplicities)
while y != lenUniqs and multiplicities[y] == lstCntRpts[y]:
y += 1
if y == lenUniqs:
break
The class with
__next__
version is the one suitable to be implemented as a Python extension module because there is no equivalent of yield in C, so it makes sense to find out how it could be improved in order to perform comparable to the function with yield variant.
已经用C写好了。您看到的性能差异完全是由于 Python 实现的属性不适用于您计划编写的 C 扩展模块。您可以应用于 Python class 的优化不适用于 C 代码。
例如,在 Python 代码中访问实例变量比访问局部变量更昂贵,因为访问实例变量需要多次字典查找。您的 C 实现不需要这样的字典查找。
当您使用 yield
编写生成器函数时,保存关闭和恢复状态的开销由 CPython 内部处理(在 C 中实现)。使用 __iter__
/__next__
,您必须在每次调用时管理保存和恢复状态。在 CPython 中,Python 级代码比 C 级内置代码慢,因此 extr Python 级代码涉及状态管理(包括像访问 [=13 的属性一样简单的东西) =] 通过 dict
查找而不是加载局部变量,只有数组索引开销)最终会花费你很多。
如果您在 C 扩展模块中实现自己的迭代器协议支持类型,您将绕过此开销;保存和恢复状态应该是几个 C 级变量访问的问题(与 Python 生成器函数产生的开销相比,开销相似或更少,也就是说,非常少)。实际上,这就是生成器函数 是 ,一种 C 扩展类型,它在每次调用 tp_iternext
时保存和恢复 Python 帧(相当于 __next__
).
将itertools文档的一些recipes重写为C扩展时,我取得了一些经验。我想我可能有一些见解可以帮助你。
生成器与迭代器class。
当您编写纯 Python 代码时,它是速度(生成器)和功能(迭代器)之间的权衡。
yield
函数(称为生成器)是为了提高速度,通常可以在不考虑内部状态的情况下编写它们。所以编写它们更省力,而且速度很快,因为 Python 只管理所有 "state"。
生成器更快(或至少不慢)的原因主要是因为:
除了
__next__
-方法之外,他们还直接实现了__next__
-slot(通常是tp_iternext
)。在那种情况下 Python 不必查找__next__
方法 - 这基本上是在以下示例中使其更快的原因:from itertools import islice def test(): while True: yield 1 class Test(object): def __iter__(self): return self def __next__(self): return 1 %timeit list(islice(test(), 1000)) # 173 µs ± 2.15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) %timeit list(islice(Test(), 1000)) # 499 µs ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
所以它几乎快了 3 倍,因为生成器直接填充
__next__
-slot。A
yield
-函数和 class 有一个状态,但是yield
函数保存和加载状态的速度比使用 class 和属性访问:def test(): i = 0 while True: yield i i += 1 class Test(object): def __init__(self): self.val = 0 def __iter__(self): return self def __next__(self): current = self.val self.val += 1 return current %timeit list(islice(test(), 1000)) # 296 µs ± 1.73 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %timeit list(islice(Test(), 1000)) # 1.22 ms ± 3.12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
这一次 class 已经慢了 4 倍(与几乎 3 倍相比,当不涉及任何状态时)。这是一个累积效应:所以你拥有的 "state" 越多,class 变体就会越慢。
yield
与 class 方法就这么多了。请注意,实际时间将取决于操作的种类。例如,如果调用 next
时 运行 的实际代码是 slow(即 time.sleep(1)
),那么生成器和 class!
赛通
如果你想要一个 cython 迭代器 class 即 fast 它必须是 cdef class
。否则你不会得到真正快的 class。原因是只有一个cdef class
创建了一个直接实现tp_iternext
字段的扩展类型!我将使用 IPythons %%cython
来编译代码(因此我不必包含设置):
%%cython
def test():
while True:
yield 1
class Test(object):
def __iter__(self):
return self
def __next__(self):
return 1
cdef class Test_cdef(object):
def __iter__(self):
return self
def __next__(self):
return 1
%timeit list(islice(test(), 1000))
# 113 µs ± 4.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit list(islice(Test(), 1000))
# 407 µs ± 16.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit list(islice(Test_cdef(), 1000))
# 62.8 µs ± 2.46 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
时间已经表明生成器和基本 class 比纯 Python 等效项更快,但它们的相对性能大致保持不变。然而 cdef class
变体击败了它们,这主要是因为使用了 tp_iternext
插槽而不是仅仅实现 __next__
方法。 (如果您不相信我,请检查 Cython 生成的 C 代码 :))
然而,它只比 Python 生成器快 2 倍,这还不错,但也不是完全压倒性的。要获得真正惊人的加速,您需要找到一种方法来表达您的程序 而无需 Python 对象 (Python 对象越少,加速越大)。例如,如果您使用字典来存储项目并且它是多重的,您仍然存储 Python 对象并且任何查找都必须使用 python 字典方法完成 - 即使您可以通过 C API 函数而不必查找真正的方法:
%%cython
cpdef cython_count(items):
cdef dict res = dict()
for item in items:
if item in res:
res[item] += 1
else:
res[item] = 1
return res
import random
def count(items):
res = {}
for item in items:
if item in res:
res[item] += 1
else:
res[item] = 1
return res
l = [random.randint(0, 100) for _ in range(10000)]
%timeit cython_count(l)
# 2.06 ms ± 13 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit count(l)
# 3.63 ms ± 21.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
这里有一个问题,您没有使用具有优化 C 代码(至少在 python-3 中)的 collections.Counter
进行此类操作:
from collections import Counter
%timeit Counter(l)
# 1.17 ms ± 41.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
这里有一个简短的说明:不要使用 something in some_dict.keys()
,因为 keys()
在 Python2 中是类似列表的,并且只有 O(n)
包含操作,而 something in some_dict
通常是 O(1)
(都是 Python)!这将使两个版本的速度更快,尤其是 Python2:
def count2(items):
res = {}
for item in items:
if item in res.keys(): # with "keys()"
res[item] += 1
else:
res[item] = 1
return res
# Python3
l = [random.randint(0, 100) for _ in range(10000)]
%timeit count(l)
# 3.63 ms ± 29 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit count2(l)
# 5.9 ms ± 20 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Python2
l = [random.randint(0, 10000) for _ in range(10000)]
%timeit count(l)
# 100 loops, best of 3: 4.59 ms per loop
%timeit count2(l)
# 1 loop, best of 3: 2.65 s per loop <--- WHOOPS!!!
这表明当您使用 python 结构时,您只能希望使用 Cython(和 C 扩展)实现 3-4 倍的加速,但即使是使用“.keys()”这样的小错误也会花费如果使用不当,您 在性能方面 多得多。
优化 Cython
如果你想让它更快,你能做什么?答案相对简单:基于 C 类型而不是 Python 类型创建您自己的数据结构。
这意味着你必须考虑设计:
- 您想在
uniqComb**
中支持哪些类型?你想要整数吗(例子是这么说的,但我想你想要任意 Python 对象)。 - 您想从 Python 进行自省(例如当前状态)吗?如果你想要它,将多重性保持为 python 对象是有意义的,但如果你不在乎,你可以将它们保存为类似整数的对象而不是 python 对象。
- 您需要传递给
uniqComb**
函数的对象是可排序的吗?您使用了sorted
,但您也可以使用OrderedDict
并按出现顺序而不是按数值保留键。
这些问题的答案(这些只是我立即问自己的问题,可能还有更多!)可以帮助您决定可以在内部使用哪种结构。例如,使用 Cython,您可以连接到 C++,并且可以使用包含整数键和整数值的 map
而不是字典。它是默认排序的,因此您无需自己手动对它们进行排序,您可以对本机整数而不是 Python 对象进行操作。但是您失去了在 uniqComb
中处理任意 python 对象的能力,并且您需要知道如何在 Cython 中使用 C++ 类型进行操作。不过它可能会快得惊人!
我没有走那条路,因为我假设你想支持任意可排序的 python 类型,我坚持使用 Counter
作为起点,但我会将重数保存为整数array.array
s 而不是 list
。我们称它为 "least invasive" 优化。如果您对 lstCntRpts
和 multiplicities
使用 list
或 array
,实际上在性能方面并不重要,因为它们不是瓶颈 - 但它是更快一点并节省一点内存 和 更重要的是,它展示了如何使用 cython 包含同质 array
s:
%%cython
from cpython.list cimport PyList_Size # (most) C API functions can be used with cython!
from array import array
from collections import Counter
cdef class uniqCmboClassIter:
cdef list lstUniqs
cdef Py_ssize_t lenUniqs
cdef int[:] lstCntRpts # memoryview
cdef Py_ssize_t lenCmbo
cdef list cmboAsIdxUniqs
cdef int[:] multiplicities # memoryview
cdef Py_ssize_t idxIntoCmbo
cdef Py_ssize_t idxIntoUniqs
cdef bint stopIteration
cdef Py_ssize_t x
cdef Py_ssize_t y
def __init__(self, lstItems, lenCmbo):
dctCounter = Counter(lstItems)
self.lstUniqs = sorted(dctCounter)
self.lenUniqs = PyList_Size(self.lstUniqs)
self.lstCntRpts = array('i', [dctCounter[item] for item in self.lstUniqs])
self.lenCmbo = lenCmbo
self.cmboAsIdxUniqs = [None] * lenCmbo
self.multiplicities = array('i', [0] * self.lenUniqs)
self.idxIntoCmbo, self.idxIntoUniqs = 0, 0
while self.idxIntoCmbo != self.lenCmbo and self.idxIntoUniqs != self.lenUniqs:
count = min(self.lstCntRpts[self.idxIntoUniqs], self.lenCmbo-self.idxIntoCmbo)
self.cmboAsIdxUniqs[self.idxIntoCmbo : self.idxIntoCmbo + count] = [self.idxIntoUniqs] * count
self.multiplicities[self.idxIntoUniqs] = count
self.idxIntoCmbo += count
self.idxIntoUniqs += 1
# print("self.multiplicities:", self.multiplicities)
# print("self.cmboAsIdxUniqs:", self.cmboAsIdxUniqs)
if self.idxIntoCmbo != self.lenCmbo:
return
self.stopIteration = False
self.x = 0
self.y = 0
return
def __iter__(self):
return self
def __next__(self):
if self.stopIteration is True:
raise StopIteration
nextCmbo = tuple(self.lstUniqs[idxUniqs] for idxUniqs in self.cmboAsIdxUniqs)
for self.idxIntoCmbo in reversed(range(self.lenCmbo)):
self.x = self.cmboAsIdxUniqs[self.idxIntoCmbo]
self.y = self.x + 1
if self.y < self.lenUniqs and self.multiplicities[self.y] < self.lstCntRpts[self.y]:
break
else:
self.stopIteration = True
return nextCmbo
for self.idxIntoCmbo in range(self.idxIntoCmbo, self.lenCmbo):
self.x = self.cmboAsIdxUniqs[self.idxIntoCmbo]
self.cmboAsIdxUniqs[self.idxIntoCmbo] = self.y
self.multiplicities[self.x] -= 1
self.multiplicities[self.y] += 1
# print("# multiplicities:", multiplicities)
while self.y != self.lenUniqs and self.multiplicities[self.y] == self.lstCntRpts[self.y]:
self.y += 1
if self.y == self.lenUniqs:
break
return nextCmbo
你实际上没有分享你的时间参数,但我用我的一些参数试过了:
from itertools import combinations
import random
import time
def create_values(maximum):
vals = [random.randint(0, maximum) for _ in range(48)]
print('length: ', len(vals))
print('sorted values: ', sorted(vals))
print('uniques: ', len(set(vals)))
print('uniques in percent: {:%}'.format(len(set(vals)) / len(vals)))
return vals
class Timer(object):
def __init__(self):
pass
def __enter__(self):
self._time = time.time()
def __exit__(self, *args, **kwargs):
print(time.time() - self._time)
vals = create_values(maximum=50) # and 22 and 75 and 120
n = 6
with Timer():
list(combinations(vals, n))
with Timer():
list(uniqCmboClassIter(vals, n))
with Timer():
list(uniqCmboClassIterOriginal(vals, n))
with Timer():
list(uniqCmboYieldIterOriginal(vals, n))
length: 48 sorted values: [0, 0, 0, 1, 2, 2, 4, 5, 5, 6, 6, 6, 7, 7, 7, 8, 8, 8, 8, 9, 9, 10, 11, 11, 12, 12, 12, 13, 13, 14, 14, 14, 15, 15, 15, 17, 18, 19, 19, 19, 19, 20, 20, 20, 21, 21, 22, 22] uniques: 21 uniques in percent: 43.750000% 6.250450611114502 0.4217393398284912 4.250436305999756 2.7186365127563477 length: 48 sorted values: [1, 1, 2, 5, 6, 7, 7, 8, 8, 9, 11, 13, 13, 15, 16, 16, 16, 16, 17, 19, 19, 21, 21, 23, 24, 26, 27, 28, 28, 29, 31, 31, 34, 34, 36, 36, 38, 39, 39, 40, 41, 42, 44, 46, 47, 47, 49, 50] uniques: 33 uniques in percent: 68.750000% 6.2034173011779785 4.343803882598877 42.39261245727539 26.65750527381897 length: 48 sorted values: [4, 4, 7, 9, 10, 14, 14, 17, 19, 21, 23, 24, 24, 26, 34, 36, 40, 42, 43, 43, 45, 46, 46, 52, 53, 58, 59, 59, 61, 63, 66, 68, 71, 72, 72, 75, 76, 80, 82, 82, 83, 84, 86, 86, 89, 92, 97, 99] uniques: 39 uniques in percent: 81.250000% 6.859697341918945 10.437987327575684 104.12988543510437 65.25306582450867 length: 48 sorted values: [4, 7, 11, 19, 24, 29, 32, 36, 49, 49, 54, 57, 58, 60, 62, 65, 67, 70, 70, 72, 72, 79, 82, 83, 86, 89, 89, 90, 91, 94, 96, 99, 102, 111, 112, 118, 120, 120, 128, 129, 129, 134, 138, 141, 141, 144, 146, 147] uniques: 41 uniques in percent: 85.416667% 6.484673023223877 13.610010623931885 136.28764533996582 84.73834943771362
它绝对比原来的方法执行得更好,实际上 just 类型声明快了好几倍。可能还有很多可以优化的地方(禁用边界检查,使用 Python C API 函数调用,使用无符号整数或更小的整数,如果你知道 "maximum" 和 "minimum"你的多重性,...) - 但事实上它并不比 itertools.combinations
慢很多,即使对于 80% 的独特项目,而且比任何原始实现都快得多,这对我来说已经足够好了。 :-)