为什么我的循环每次迭代都需要更多内存?
Why does my loop require more memory on each iteration?
我正在尝试减少 python 3 代码的内存需求。现在 for 循环的每次迭代都比上一次需要更多的内存。
我写了一小段代码,其行为与我的项目相同:
import numpy as np
from multiprocessing import Pool
from itertools import repeat
def simulation(steps, y): # the function that starts the parallel execution of f()
pool = Pool(processes=8, maxtasksperchild=int(steps/8))
results = pool.starmap(f, zip(range(steps), repeat(y)), chunksize=int(steps/8))
pool.close()
return results
def f(steps, y): # steps is used as a counter. My code doesn't need it.
a, b = np.random.random(2)
return y*a, y*b
def main():
steps = 2**20 # amount of times a random sample is taken
y = np.ones(5) # dummy variable to show that the next iteration of the code depends on the previous one
total_results = np.zeros((0,2))
for i in range(5):
results = simulation(steps, y[i-1])
y[i] = results[0][0]
total_results = np.vstack((total_results, results))
print(total_results, y)
if __name__ == "__main__":
main()
对于 for 循环的每次迭代,simulation() 中的线程各自的内存使用量等于我的代码使用的总内存量。
每次并行进程 运行 时是否 Python 克隆我的整个环境,包括 f() 不需要的变量?我怎样才能防止这种行为?
理想情况下,我希望我的代码只复制执行 f() 所需的内存,同时我可以将结果保存在内存中。
虽然脚本确实使用了相当多的内存,即使使用 "smaller" 示例值,
的答案
Does Python clone my entire environment each time the parallel
processes are run, including the variables not required by f()? How
can I prevent this behaviour?
是它以某种方式克隆具有forking a new process, but if copy-on-write语义的环境可用,在写入之前不需要复制实际的物理内存。例如在这个系统上
% uname -a
Linux mypc 4.2.0-27-generic #32-Ubuntu SMP Fri Jan 22 04:49:08 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
COW
似乎可用并正在使用,但在其他系统上可能并非如此。在 Windows 上,这是完全不同的,因为新的 Python 解释器是从 .exe
执行的,而不是分叉的。由于您提到使用 htop
,您使用的是某种 UNIX 或类似 UNIX 的系统,并且您获得了 COW
语义。
For each iteration of the for loop the processes in simulation() each
have a memory usage equal to the total memory used by my code.
生成的进程将显示几乎相同的 RSS
, but this can be misleading, because mostly they occupy the same actual physical memory mapped to multiple processes, if writes do not occur. With Pool.map
the story is a bit more complicated, since it "chops the iterable into a number of chunks which it submits to the process pool as separate tasks". This submitting happens over IPC
值,并且将复制提交的数据。在您的示例中, IPC
和 2**20 函数调用也支配着 CPU 用法。在 simulation
中用单个矢量化乘法替换映射使脚本的运行时间从大约 150 秒缩短到 0.66 秒。
我们可以通过一个(稍微)简化的示例来观察 COW
,该示例分配一个大数组并将其传递给生成的进程以进行只读处理:
import numpy as np
from multiprocessing import Process, Condition, Event
from time import sleep
import psutil
def read_arr(arr, done, stop):
with done:
S = np.sum(arr)
print(S)
done.notify()
while not stop.is_set():
sleep(1)
def main():
# Create a large array
print('Available before A (MiB):', psutil.virtual_memory().available / 1024 ** 2)
input("Press Enter...")
A = np.random.random(2**28)
print('Available before Process (MiB):', psutil.virtual_memory().available / 1024 ** 2)
input("Press Enter...")
done = Condition()
stop = Event()
p = Process(target=read_arr, args=(A, done, stop))
with done:
p.start()
done.wait()
print('Available with Process (MiB):', psutil.virtual_memory().available / 1024 ** 2)
input("Press Enter...")
stop.set()
p.join()
if __name__ == '__main__':
main()
这台机器上的输出:
% python3 test.py
Available before A (MiB): 7779.25
Press Enter...
Available before Process (MiB): 5726.125
Press Enter...
134221579.355
Available with Process (MiB): 5720.79296875
Press Enter...
现在,如果我们用修改数组的函数替换函数 read_arr
:
def mutate_arr(arr, done, stop):
with done:
arr[::4096] = 1
S = np.sum(arr)
print(S)
done.notify()
while not stop.is_set():
sleep(1)
结果大不相同:
Available before A (MiB): 7626.12109375
Press Enter...
Available before Process (MiB): 5571.82421875
Press Enter...
134247509.654
Available with Process (MiB): 3518.453125
Press Enter...
for 循环在每次迭代后确实需要更多内存,但这是显而易见的:它从映射中堆叠 total_results
,因此它必须分配 space 来容纳新数组旧结果和新结果并释放现在未使用的旧结果数组。
也许你应该知道thread
和process
在Operating System
中的区别。看到这个 What is the difference between a process and a thread.
在for循环中,有processes
,没有threads
。线程共享创建它的进程的地址space;进程有自己的地址 space.
你可以打印进程id,输入os.getpid()
。
我正在尝试减少 python 3 代码的内存需求。现在 for 循环的每次迭代都比上一次需要更多的内存。
我写了一小段代码,其行为与我的项目相同:
import numpy as np
from multiprocessing import Pool
from itertools import repeat
def simulation(steps, y): # the function that starts the parallel execution of f()
pool = Pool(processes=8, maxtasksperchild=int(steps/8))
results = pool.starmap(f, zip(range(steps), repeat(y)), chunksize=int(steps/8))
pool.close()
return results
def f(steps, y): # steps is used as a counter. My code doesn't need it.
a, b = np.random.random(2)
return y*a, y*b
def main():
steps = 2**20 # amount of times a random sample is taken
y = np.ones(5) # dummy variable to show that the next iteration of the code depends on the previous one
total_results = np.zeros((0,2))
for i in range(5):
results = simulation(steps, y[i-1])
y[i] = results[0][0]
total_results = np.vstack((total_results, results))
print(total_results, y)
if __name__ == "__main__":
main()
对于 for 循环的每次迭代,simulation() 中的线程各自的内存使用量等于我的代码使用的总内存量。
每次并行进程 运行 时是否 Python 克隆我的整个环境,包括 f() 不需要的变量?我怎样才能防止这种行为?
理想情况下,我希望我的代码只复制执行 f() 所需的内存,同时我可以将结果保存在内存中。
虽然脚本确实使用了相当多的内存,即使使用 "smaller" 示例值,
的答案Does Python clone my entire environment each time the parallel processes are run, including the variables not required by f()? How can I prevent this behaviour?
是它以某种方式克隆具有forking a new process, but if copy-on-write语义的环境可用,在写入之前不需要复制实际的物理内存。例如在这个系统上
% uname -a
Linux mypc 4.2.0-27-generic #32-Ubuntu SMP Fri Jan 22 04:49:08 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
COW
似乎可用并正在使用,但在其他系统上可能并非如此。在 Windows 上,这是完全不同的,因为新的 Python 解释器是从 .exe
执行的,而不是分叉的。由于您提到使用 htop
,您使用的是某种 UNIX 或类似 UNIX 的系统,并且您获得了 COW
语义。
For each iteration of the for loop the processes in simulation() each have a memory usage equal to the total memory used by my code.
生成的进程将显示几乎相同的 RSS
, but this can be misleading, because mostly they occupy the same actual physical memory mapped to multiple processes, if writes do not occur. With Pool.map
the story is a bit more complicated, since it "chops the iterable into a number of chunks which it submits to the process pool as separate tasks". This submitting happens over IPC
值,并且将复制提交的数据。在您的示例中, IPC
和 2**20 函数调用也支配着 CPU 用法。在 simulation
中用单个矢量化乘法替换映射使脚本的运行时间从大约 150 秒缩短到 0.66 秒。
我们可以通过一个(稍微)简化的示例来观察 COW
,该示例分配一个大数组并将其传递给生成的进程以进行只读处理:
import numpy as np
from multiprocessing import Process, Condition, Event
from time import sleep
import psutil
def read_arr(arr, done, stop):
with done:
S = np.sum(arr)
print(S)
done.notify()
while not stop.is_set():
sleep(1)
def main():
# Create a large array
print('Available before A (MiB):', psutil.virtual_memory().available / 1024 ** 2)
input("Press Enter...")
A = np.random.random(2**28)
print('Available before Process (MiB):', psutil.virtual_memory().available / 1024 ** 2)
input("Press Enter...")
done = Condition()
stop = Event()
p = Process(target=read_arr, args=(A, done, stop))
with done:
p.start()
done.wait()
print('Available with Process (MiB):', psutil.virtual_memory().available / 1024 ** 2)
input("Press Enter...")
stop.set()
p.join()
if __name__ == '__main__':
main()
这台机器上的输出:
% python3 test.py
Available before A (MiB): 7779.25
Press Enter...
Available before Process (MiB): 5726.125
Press Enter...
134221579.355
Available with Process (MiB): 5720.79296875
Press Enter...
现在,如果我们用修改数组的函数替换函数 read_arr
:
def mutate_arr(arr, done, stop):
with done:
arr[::4096] = 1
S = np.sum(arr)
print(S)
done.notify()
while not stop.is_set():
sleep(1)
结果大不相同:
Available before A (MiB): 7626.12109375
Press Enter...
Available before Process (MiB): 5571.82421875
Press Enter...
134247509.654
Available with Process (MiB): 3518.453125
Press Enter...
for 循环在每次迭代后确实需要更多内存,但这是显而易见的:它从映射中堆叠 total_results
,因此它必须分配 space 来容纳新数组旧结果和新结果并释放现在未使用的旧结果数组。
也许你应该知道thread
和process
在Operating System
中的区别。看到这个 What is the difference between a process and a thread.
在for循环中,有processes
,没有threads
。线程共享创建它的进程的地址space;进程有自己的地址 space.
你可以打印进程id,输入os.getpid()
。