Python 每执行 98 次就会面临开销?
Python is facing an overhead every 98 executions?
我有一个很大的数据库,我只想为一个新列分配一个常量。在第一次处决时(从 1 到 97);一切都很好,代码运行速度很快。然后内存在第 98 次迭代时飙升,然后直到第 196 次迭代(98 次迭代之后)RAM 再次飙升,然后循环在每个 i
处继续内存火箭,其中 i
是乘法共 98...
我猜这个神秘数字98可能会因您的PC而异。
您可能必须更改数据库大小才能重现该问题。
这是我的代码
编辑:我认为那不是垃圾回收,因为代码末尾的gc.isenabled()
returns False
import pandas as pd
import numpy as np
n = 2000000
data = pd.DataFrame({'a' : range(n)})
for i in range(1, 100):
data['col_' + str(i)] = np.random.choice(['a', 'b'], n)
gc.disable()
for i in range(1, 600):
data['test_{}'.format(i)] = i
print(str(i)) # slow at every i multiplication of 98
gc.isenabled()
> False
这是我的内存使用情况,峰值出现在迭代 i*98
(其中 i
是一个整数)
我在 Windows 10,Python 3.6.1 |蟒蛇 4.4.0 | pandas0.24.2
我有 16 GB 内存和 8 核 CPU
首先,我想在 Ubuntu 上确认相同的行为,16 GB RAM 和 GC 被禁用。因此,这绝对不是 GC 或 Windows 内存管理的问题。
其次,在我的系统上,它在每 99 次迭代后变慢:99 次之后、198 次之后、297 次之后等等。无论如何,我的交换文件相当有限,所以当 RAM+Swap 被填满时,它崩溃并显示以下堆栈跟踪:
294
295
296
297
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py", line 2657, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'test_298'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 1053, in set
loc = self.items.get_loc(item)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py", line 2659, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'test_298'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "py-memory-test.py", line 12, in <module>
data['test_{}'.format(i)] = i
File "/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py", line 3370, in __setitem__
self._set_item(key, value)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py", line 3446, in _set_item
NDFrame._set_item(self, key, value)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 3172, in _set_item
self._data.set(key, value)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 1056, in set
self.insert(len(self.items), item, value)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 1184, in insert
self._consolidate_inplace()
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 929, in _consolidate_inplace
self.blocks = tuple(_consolidate(self.blocks))
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 1899, in _consolidate
_can_consolidate=_can_consolidate)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/blocks.py", line 3149, in _merge_blocks
new_values = new_values[argsort]
MemoryError
因此,似乎 pandas
有时会在插入时执行某种 merging/consolidation/repacking。让我们看一下 core/internals/managers.py 的 insert
函数,它有以下几行:
def insert(self, loc, item, value, allow_duplicates=False):
...
self._known_consolidated = False
if len(self.blocks) > 100:
self._consolidate_inplace()
我想这正是我们要找的!
每次我们做 insert
都会创建新块。当块数超过某个限制时,将执行额外的工作(合并)。代码中的 100 个块限制与我们获得的大约 98-99 的经验数字之间的差异可能是由于存在一些额外的数据帧 meta-data,这也需要一些空间。
UPD:为了证明这个假设,我尝试更改 100 -> 1000000,它工作得很好,没有性能差距,没有 MemoryError
。但是publicAPI里面没有修改这个参数run-time,简直硬编码了
UPD2:向 pandas
提交了一个 issue,因为 MemoryError
看起来不适合这样一个简单的程序.
我有一个很大的数据库,我只想为一个新列分配一个常量。在第一次处决时(从 1 到 97);一切都很好,代码运行速度很快。然后内存在第 98 次迭代时飙升,然后直到第 196 次迭代(98 次迭代之后)RAM 再次飙升,然后循环在每个 i
处继续内存火箭,其中 i
是乘法共 98...
我猜这个神秘数字98可能会因您的PC而异。 您可能必须更改数据库大小才能重现该问题。
这是我的代码
编辑:我认为那不是垃圾回收,因为代码末尾的gc.isenabled()
returns False
import pandas as pd
import numpy as np
n = 2000000
data = pd.DataFrame({'a' : range(n)})
for i in range(1, 100):
data['col_' + str(i)] = np.random.choice(['a', 'b'], n)
gc.disable()
for i in range(1, 600):
data['test_{}'.format(i)] = i
print(str(i)) # slow at every i multiplication of 98
gc.isenabled()
> False
这是我的内存使用情况,峰值出现在迭代 i*98
(其中 i
是一个整数)
我在 Windows 10,Python 3.6.1 |蟒蛇 4.4.0 | pandas0.24.2
我有 16 GB 内存和 8 核 CPU
首先,我想在 Ubuntu 上确认相同的行为,16 GB RAM 和 GC 被禁用。因此,这绝对不是 GC 或 Windows 内存管理的问题。
其次,在我的系统上,它在每 99 次迭代后变慢:99 次之后、198 次之后、297 次之后等等。无论如何,我的交换文件相当有限,所以当 RAM+Swap 被填满时,它崩溃并显示以下堆栈跟踪:
294
295
296
297
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py", line 2657, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'test_298'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 1053, in set
loc = self.items.get_loc(item)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py", line 2659, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'test_298'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "py-memory-test.py", line 12, in <module>
data['test_{}'.format(i)] = i
File "/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py", line 3370, in __setitem__
self._set_item(key, value)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py", line 3446, in _set_item
NDFrame._set_item(self, key, value)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 3172, in _set_item
self._data.set(key, value)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 1056, in set
self.insert(len(self.items), item, value)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 1184, in insert
self._consolidate_inplace()
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 929, in _consolidate_inplace
self.blocks = tuple(_consolidate(self.blocks))
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 1899, in _consolidate
_can_consolidate=_can_consolidate)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/blocks.py", line 3149, in _merge_blocks
new_values = new_values[argsort]
MemoryError
因此,似乎 pandas
有时会在插入时执行某种 merging/consolidation/repacking。让我们看一下 core/internals/managers.py 的 insert
函数,它有以下几行:
def insert(self, loc, item, value, allow_duplicates=False):
...
self._known_consolidated = False
if len(self.blocks) > 100:
self._consolidate_inplace()
我想这正是我们要找的!
每次我们做 insert
都会创建新块。当块数超过某个限制时,将执行额外的工作(合并)。代码中的 100 个块限制与我们获得的大约 98-99 的经验数字之间的差异可能是由于存在一些额外的数据帧 meta-data,这也需要一些空间。
UPD:为了证明这个假设,我尝试更改 100 -> 1000000,它工作得很好,没有性能差距,没有 MemoryError
。但是publicAPI里面没有修改这个参数run-time,简直硬编码了
UPD2:向 pandas
提交了一个 issue,因为 MemoryError
看起来不适合这样一个简单的程序.