Python 并行处理 - Linux 和 Windows 之间的不同行为

Question

我试图让我的代码并行，但我运行变成了一个我无法解释的奇怪的东西。

让我定义上下文。我有一个非常繁重的计算要做，读取多个文件，对其进行机器学习分析，涉及很多数学。我的代码运行通常在 Windows 和 Linux 上是顺序的，但是当我尝试使用多处理时，一切都会中断。下面是一个示例，我首先在 Windows:

上开发

from multiprocessing.dummy import Pool as ThreadPool 

def ppp(element):
    window,day = element
    print(window,day)
    time.sleep(5)
    return

if __name__ == '__main__'    
    #%% Reading datasets
    print('START')
    start_time = current_milli_time()
    tree = pd.read_csv('datan\days.csv')
    days = list(tree.columns)
    # to be able to run this code uncomment the following line and comment the previous two
    # days = ['0808', '0810', '0812', '0813', '0814', '0817', '0818', '0827', '0828', '0829']
    windows = [1000]
    processes_args = list(itertools.product(windows, days))

    pool = ThreadPool(8) 
    results = pool.map_async(ppp, processes_args)
    pool.close() 
    pool.join() 
    print('END', current_milli_time()-start_time, 'ms')

当我在 Windows 上运行这段代码时，输出如下所示：

START
100010001000 1000 1000100010001000      081008120808
08130814
0818
082708171000
1000    
  08290828

END 5036 ms

在 125 毫秒内打印了一组凌乱的打印件。 Linux 上的行为也相同。但是，我注意到如果我在 Linux 上应用此方法并查看 'htop'，我看到的是一组随机选择执行的线程，但它们从不并行执行。因此，经过一些 google 搜索，我想出了这个新代码：

from multiprocessing import Pool as ProcessPool

def ppp(element):
    window,day = element
    print(window,day)
    time.sleep(5)
    return

if __name__ == '__main__':
    #%% Reading datasets
    print('START')
    start_time = current_milli_time()
    tree = pd.read_csv('datan\days.csv')
    days = list(tree.columns)
    # to be able to run this code uncomment the following line and comment the previous two
    # days = ['0808', '0810', '0812', '0813', '0814', '0817', '0818', '0827', '0828', '0829']
    windows = [1000]
    processes_args = list(itertools.product(windows, days))

    pool = ProcessPool(8) 
    results = pool.map_async(ppp, processes_args)
    pool.close() 
    pool.join() 
    print('END', current_milli_time()-start_time, 'ms')

如您所见，我更改了 import 语句，它基本上创建了一个进程池而不是线程池。这解决了 Linux 上的问题，事实上在真实场景中，我有 8 个处理器运行ning 100%，系统中有 8 个进程运行ning。输出看起来像之前的那个。但是，当我在 windows 上使用此代码时，整个运行ning 需要超过 10 秒，而且，我没有得到 ppp 的任何打印件，只有那些主要的。

我真的试图寻找一个解释，但我不明白为什么会这样。例如这里：, they talk about safe code on windows and the answer suggests to move to Threading, that, as a side effect, will make the code not parallel, but concurrent. Here another example: . All these questions describe fork() and spawn processes, but I personally think that the point of my question is not that. Python documentation still explains that windows does not have a fork() method (https://docs.python.org/2/library/multiprocessing.html#programming-guidelines)。

总而言之，现在我确信我不能在 Windows 中进行并行处理，但我认为我从所有这些讨论中得出的结论是错误的。因此，我的问题应该是：在 Windows 中是否可以运行并行（在不同的 CPU 上）进程或线程？

编辑：在两个示例中添加 name == main

EDIT2：为了能够运行这个函数的代码，需要这些导入：

import time
import itertools    
current_milli_time = lambda: int(round(time.time() * 1000))

Answer 1

under windows, python 使用 pickle/unpickle 模仿 multiprocessing 模块中的 fork, 当做 unpickle, 模块重新导入，全局范围内的任何代码再次执行，the docs 声明：

Instead one should protect the “entry point” of the program by using if name == 'main'

此外，你应该使用pool.map_async返回的AsyncResult，或者简单地使用pool.map。

Answer 2

你可以在 Windows 下进行并行处理（我有一个脚本运行现在正在进行大量计算并使用所有 8 个内核的 100%）但是它的工作方式是创建并行进程，而不是线程（由于 GIL，除了 I/O 操作外，线程将无法工作）。几个要点：

你需要使用concurrent.futures.ProcessPoolExecutor()（注意它是进程池而不是线程池）。参见 https://docs.python.org/3/library/concurrent.futures.html。简而言之，它的工作方式是将要并行化的代码放入函数中，然后调用 executor.map() 来完成拆分。
请注意，在 Windows 上，每个并行进程都将从头开始，因此您可能需要在一些地方使用 if __name__ == '__main__:' 来区分您在主进程中所做的与在其他。您在主脚本中加载的数据将被复制到子进程，因此它必须是可序列化的（在 Python 行话中是 pickl'able）。
为了有效地使用核心，避免将数据写入所有进程共享的对象（例如进度计数器或公共数据结构）。否则进程之间的同步会降低性能。所以从任务管理器监控执行。

Python 并行处理 - Linux 和 Windows 之间的不同行为

Python parallel processing - Different behaviors between Linux and Windows

python

linux

windows

python-multithreading

python-multiprocessing