在Python中使用threading/multiprocessing并发下载图片

Question

我有一个构建数据集的搜索查询列表：

classes = [...]。此列表中有 100 个搜索查询。

基本上，我将列表分成 4 个块，每块 25 个查询。

def divide_chunks(l, n):
    for i in range(0, len(l), n):
        yield classes[i:i + n]

classes = list(divide_chunks(classes, 25))

在下面，我创建了一个从每个块迭代下载查询的函数：

def download_chunk(n):
    for label in classes[n]:
        try:
            downloader.download(label, limit=1000, output_dir='dataset', adult_filter_off=True, force_replace=False,verbose=True)
        except:
            pass

但是，我想运行每个 4 个块并发。换句话说，我想同时进行运行 4 个单独的迭代操作。我采用了 Threading 和 Multiprocessing 两种方法，但它们都不起作用：

process_1 = Process(target=download_chunk(0))
process_1.start()
process_2 = Process(target=download_chunk(1))
process_2.start()
process_3 = Process(target=download_chunk(2))
process_3.start()
process_4 = Process(target=download_chunk(3))
process_4.start()

process_1.join()
process_2.join()
process_3.join()
process_4.join()

###########################################################

thread_1 = threading.Thread(target=download_chunk(0)).start()
thread_2 = threading.Thread(target=download_chunk(1)).start()
thread_3 = threading.Thread(target=download_chunk(2)).start()
thread_4 = threading.Thread(target=download_chunk(3)).start()

Answer 1

您运行 download_chunk 在 thread/process 之外。您需要单独提供函数和参数才能延迟执行：

例如：

Process(target=download_chunk, args=(0,))

Refer to the multiprocessing docs for more information about using the multiprocessing.Process class.

对于这个用例，我建议使用 multiprocessing.Pool:

from multiprocessing import Pool

if __name__ == '__main__':
    with Pool(4) as pool:
        pool.map(download_chunk, range(4))

它处理创建、启动和稍后加入 4 个进程的工作。每个进程使用可迭代对象中提供的每个参数调用 download_chunk，在本例中为 range(4)。

More info about multiprocessing.Pool can be found in the docs.

在Python中使用threading/multiprocessing并发下载图片

Using threading/multiprocessing in Python to download images concurrently

python

multithreading

multiprocessing