numpy 如何避免从 gc 引用计数对子进程的访问复制

Question

在 POSIX 系统上，在 fork() 之后，数据应该只在写入后复制到子进程（写时复制）。但是因为python在对象头中保存了引用计数，每次在子进程中迭代一个列表，它都会复制到它的内存中。

使用列表和其他数据结构进行测试，我可以断言该行为，以及来自核心开发人员的一些证实： https://github.com/python/cpython/pull/3705#issuecomment-420201071

但是在用 numpy 数组测试之后，这并没有发生。

import ctypes
import os

import numpy as np
import psutil


def sharing_with_numpy():
    ppid = os.getpid()
    print(f'\nSystem used memory: {int(psutil.virtual_memory().used / (1024 * 1024))} MB')
    big_data = np.array([[item, item] for item in list(range(10000000))])
    print(f'\nSystem used memory: {int(psutil.virtual_memory().used / (1024 * 1024))} MB')
    print(ctypes.c_long.from_address(id(big_data)).value)
    ref1 = big_data[0]
    ref2 = big_data[0]
    print(ctypes.c_long.from_address(id(big_data)).value)

    print(f'\nSystem used memory: {int(psutil.virtual_memory().used / (1024 * 1024))} MB')
    for i in range(5):
        if ppid == os.getpid():
            os.fork()
    for x in big_data:
        pass
    print(f'\nSystem used memory: {int(psutil.virtual_memory().used / (1024 * 1024))} MB')


if __name__ == "__main__":
    sharing_with_numpy()

输出：

System used memory: 163 MB # before array allocation
System used memory: 318 MB # after array allocation
1 # reference count of the array
3 # reference count of the array
System used memory: 318 MB # before fork()
System used memory: 324 MB # after fork() and loop to reference array
System used memory: 328 MB # after fork() and loop to reference array
System used memory: 329 MB # after fork() and loop to reference array
System used memory: 331 MB # after fork() and loop to reference array
System used memory: 340 MB # after fork() and loop to reference array
System used memory: 342 MB # after fork() and loop to reference array

如您所见，内存增加了，但增长幅度很小，表明没有复制整个数组。

我一直在尝试理解发生了什么，但运气不好，你能解释一下吗？谢谢

Answer 1

numpy 数组有一个 object header，它包含指向基础数据的指针，单独分配。数据本身没有任何引用计数，因此仅通过读取它不会被修改。

由于 numpy 数组是批量分配的块，因此用于后备数据存储的较大分配不会来自 object 池 object headers来自（它们通常是通过 mmap [*NIX] 或 VirtualAlloc [Windows] 直接从 OS 批量分配的，而不是从细分的内存堆中分配的在许多分配中）。因为他们不与 任何东西 共享一个被引用计数的页面（他们是原始的 C 类型，而不是 Python ints 或类似的 [= =24=] headers), 这些页面永远不会被写入，因此永远不会被复制。

numpy 如何避免从 gc 引用计数对子进程的访问复制

How does numpy avoid copy on access on child process from gc reference counting

python

fork

memory-management

numpy