Shared-memory 和多处理

Question

根据 this question 及其答案，我认为我理解为什么这个 python 代码：

big_list = [
    {j: 0 for j in range(200000)}
    for i in range(60)
]

def worker():
    for dic in big_list:
        for key in dic:
            pass
        print "."
        time.sleep(0.2)

w = multiprocessing.Process(target=worker)
w.start()

time.sleep(3600)

在执行期间不断使用越来越多的内存：这是因为 child 进程在循环中将引用计数更新为 shared-memory object，触发了 "copy-on-write"机制（我可以通过 cat /proc/meminfo | grep MemFree 看到可用内存减少）。

然而，我不明白的是，如果迭代发生在 parent 而不是 child 中，为什么会发生同样的事情：

def worker():
    time.sleep(3600)

w = multiprocessing.Process(target=worker)
w.start()

for dic in big_list:
    for key in dic:
        pass
    print "."
    time.sleep(0.2)

child甚至不需要知道big_list的存在。

在这个小例子中，我可以通过将 del big_list 放在 child 函数中来解决问题，但有时变量引用无法像这个那样访问，所以事情变得复杂。

为什么会出现这种机制，我该如何正确避免它？

Answer 1

在fork()之后，parent和child"see"都是同一个地址space。第一次 either 更改公共地址处的内存时，copy-on-write (COW) 机制必须克隆包含该地址的页面。因此，为了创建 COW 页面，突变发生在 child 还是 parent.

中并不重要

在您的第二个代码片段中，您遗漏了最重要的部分：big_list 的确切创建位置。既然你说你可以在 child 中使用 del big_list，那么 big_list 可能在你分叉工作进程之前就已经存在了。如果是这样，那么 - 如上所述 - 在 parent 或 child.

中修改 big_list 对你的症状并不重要

为避免这种情况，请在创建 child 进程后创建 big_list 。那么它所在的地址space就不会被分享。或者，在 Python 3.4 或更高版本中，使用 multiprocessing.set_start_method('spawn')。然后 fork() 将不会用于创建 child 进程，并且根本没有共享地址 space （在 Windows 上总是这样，它没有fork()).

Shared-memory 和多处理

Shared-memory and multiprocessing

python

reference-counting

shared-memory

multiprocessing