是否可以在 Python 中跟踪多处理程序的 "operations per second"?

Is it possible to track "operations per second" of a multiprocessing program in Python?

我正在编写一个程序,使用 Python 中的 'map' 方法下载数千张图像。有点像这样,

def download_image(image):
    save_dir = "[PATH TO SAVE IMAGES]"
    image_url = image['url']
    image_name = image['name']

    image_data = requests.get(image_url).content
    with open(os.path.join(save_dir, f"{image_name}.jpg"), 'wb') as f:
        f.write(image_data)

from multiprocessing import Pool

pool = Pool(8)

downloads = pool.map(download_image, images)

pool.close()
pool.join()

我想跟踪程序的“每秒下载量”,以 (1) 好奇心和 (2) 优化所需的进程数。已经有一段时间了,但我记得听说由于 Python 的 multiprocessing 模块的进程独立运行,所以很难完成这样的事情。

我(在写这篇文章时)的一个想法是简单地计算程序从 'Pool' 创建到 'Pool' 关闭的运行时间,然后将这段时间除以下载的图像数量。这种方法有些东西似乎没有吸引力,但如果没有更好的选择,我想它必须这样做。

即使您似乎在朝另一个方向(线程)前进,我想我还是会回答原来的问题:

我要冒险了,假设你不需要 downloads 的输出,因为你不需要 return 函数 download_image 的任何东西反正。如果您需要,可以很容易地更改此示例,例如将结果附加到列表中。我还将假设顺序也不重要,因为您没有保留结果。考虑到这些,我建议使用 imap_unordered 而不是 map,这样每次池中的一名工作人员完成任务时,您都可以有效地获得“消息”:

from multiprocessing import Pool
from time import time

def download_image(image):
    save_dir = "[PATH TO SAVE IMAGES]"
    image_url = image['url']
    image_name = image['name']

    image_data = requests.get(image_url).content
    with open(os.path.join(save_dir, f"{image_name}.jpg"), 'wb') as f:
        f.write(image_data)

if __name__ == "__main__":
#   Get in the habit of never calling anything that could create a child process
#such as creating a "Pool" or simply calling "multiprocessing.Process" without
#guarding execution by "if __name__ == '__main__':". This is necessary when using
#Windows, it is needed in MacOS with python 3.8 and above, and is highly encouraged
#everywhere else
    pool = Pool(8) #  <- child processes are created here which can't be allowed
                   #     to happen when this file is `import`ed (which is what
                   #     `if __name__ == "__main__":` protects against).
    completed = 0
    t = time()
    for result in pool.imap_unordered(download_image, images):
        #`result` is unused in this case, but could easily be put to some use
        completed += 1
        if time() >= t+60: #once a minute
            rate = completed / (time() - t)
            print(f'{rate} operations per second')
            t = time()
            completed = 0
    print("done")

    pool.close()
    pool.join()