Python current.futures 多次导入库（多次在顶级范围内执行代码）

Question

对于以下脚本（python 3.6，windows anaconda），我注意到导入的库数量与调用的处理器数量一样多。并且 print('Hello') 也执行了多次相同的次数。

我认为处理器只会为 func1 调用而不是整个程序调用。实际的 func1 是一个繁重的 cpu 有界任务，将被执行数百万次。

对于这样的任务，这是框架的正确选择吗？

import pandas as pd
import numpy as np
from concurrent.futures import ProcessPoolExecutor

print("Hello")

def func1(x):
    return x


if __name__ == '__main__':
    print(datetime.datetime.now())    
    print('test start')

    with ProcessPoolExecutor() as executor:
        results = executor.map(func1, np.arange(1,1000))
        for r in results:
            print(r)

    print('test end')
    print(datetime.datetime.now())

Answer 1

concurrent.futures.ProcessPoolExecutor 使用 multiprocessing 模块进行多处理。

并且，如 Programming guidelines 中所述，这意味着您必须在 __main__ 块中的每个进程中保护您不想运行的任何顶级代码:

Make sure that the main module can be safely imported by a new Python interpreter without causing unintended side effects (such a starting a new process).

... one should protect the “entry point” of the program by using if __name__ == '__main__':…

请注意，只有在使用 spawn 或 forkserver start methods 时才需要这样做。但如果您使用 Windows，则 spawn 是默认设置。而且，无论如何，这样做永远不会伤害，而且通常会使代码更清晰，所以无论如何都值得这样做。

您可能不想要这样保护您的import。毕竟，每个内核调用一次 import pandas as pd 的成本可能看起来不小，但这只会在启动时发生，而运行ning 一个繁重的 CPU 绑定函数数百万次的成本将完全淹没它。（如果不是，您可能一开始就不想使用多处理……）通常，您的 def 和 class 语句也是如此（特别是如果它们没有捕获任何闭包变量或任何东西）。只有设置代码多次运行不正确（例如您示例中的 print('hello')）需要保护。

concurrent.futures 文档（以及 PEP 3148）中的示例都使用“主函数”习惯用法来处理此问题：

def main():
    # all of your top-level code goes here

if __name__ == '__main__':
    main()

这有一个额外的好处，就是将你的顶级全局变量变成局部变量，以确保你不会不小心共享它们（这对于 multiprocessing 来说尤其是一个问题，它们实际上是与fork，但用 spawn 复制，因此相同的代码在一个平台上测试时可能有效，但在另一个平台上部署时会失败。

如果你想知道为什么会发生这种情况：

使用 fork 启动方法，multiprocessing 通过克隆父 Python 解释器创建每个新的子进程，然后在您（或concurrent.futures) 创建了池。因此，顶级代码不会重新运行.

使用 spawn 启动方法，multiprocessing 通过启动一个干净的新 Python 解释器、importing 您的代码，然后启动池服务功能。因此，顶级代码将重新运行作为 import.

的一部分

Python current.futures 多次导入库（多次在顶级范围内执行代码）

Python current.futures import libraries multiple times (execute code in top scope multiple times)

python

cpu

python-multithreading

concurrent.futures

python-multiprocessing