在Python中使用threading/multiprocessing并发下载图片
Using threading/multiprocessing in Python to download images concurrently
我有一个构建数据集的搜索查询列表:
classes = [...]
。此列表中有 100 个搜索查询。
基本上,我将列表分成 4 个块,每块 25 个查询。
def divide_chunks(l, n):
for i in range(0, len(l), n):
yield classes[i:i + n]
classes = list(divide_chunks(classes, 25))
在下面,我创建了一个从每个块迭代下载查询的函数:
def download_chunk(n):
for label in classes[n]:
try:
downloader.download(label, limit=1000, output_dir='dataset', adult_filter_off=True, force_replace=False,verbose=True)
except:
pass
但是,我想 运行 每个 4 个块并发。换句话说,我想同时进行 运行 4 个单独的迭代操作。我采用了 Threading
和 Multiprocessing
两种方法,但它们都不起作用:
process_1 = Process(target=download_chunk(0))
process_1.start()
process_2 = Process(target=download_chunk(1))
process_2.start()
process_3 = Process(target=download_chunk(2))
process_3.start()
process_4 = Process(target=download_chunk(3))
process_4.start()
process_1.join()
process_2.join()
process_3.join()
process_4.join()
###########################################################
thread_1 = threading.Thread(target=download_chunk(0)).start()
thread_2 = threading.Thread(target=download_chunk(1)).start()
thread_3 = threading.Thread(target=download_chunk(2)).start()
thread_4 = threading.Thread(target=download_chunk(3)).start()
您 运行 download_chunk
在 thread/process 之外。您需要单独提供函数和参数才能延迟执行:
例如:
Process(target=download_chunk, args=(0,))
Refer to the multiprocessing docs for more information about using the multiprocessing.Process
class.
对于这个用例,我建议使用 multiprocessing.Pool
:
from multiprocessing import Pool
if __name__ == '__main__':
with Pool(4) as pool:
pool.map(download_chunk, range(4))
它处理创建、启动和稍后加入 4 个进程的工作。每个进程使用可迭代对象中提供的每个参数调用 download_chunk
,在本例中为 range(4)
。
More info about multiprocessing.Pool
can be found in the docs.
我有一个构建数据集的搜索查询列表:
classes = [...]
。此列表中有 100 个搜索查询。
基本上,我将列表分成 4 个块,每块 25 个查询。
def divide_chunks(l, n):
for i in range(0, len(l), n):
yield classes[i:i + n]
classes = list(divide_chunks(classes, 25))
在下面,我创建了一个从每个块迭代下载查询的函数:
def download_chunk(n):
for label in classes[n]:
try:
downloader.download(label, limit=1000, output_dir='dataset', adult_filter_off=True, force_replace=False,verbose=True)
except:
pass
但是,我想 运行 每个 4 个块并发。换句话说,我想同时进行 运行 4 个单独的迭代操作。我采用了 Threading
和 Multiprocessing
两种方法,但它们都不起作用:
process_1 = Process(target=download_chunk(0))
process_1.start()
process_2 = Process(target=download_chunk(1))
process_2.start()
process_3 = Process(target=download_chunk(2))
process_3.start()
process_4 = Process(target=download_chunk(3))
process_4.start()
process_1.join()
process_2.join()
process_3.join()
process_4.join()
###########################################################
thread_1 = threading.Thread(target=download_chunk(0)).start()
thread_2 = threading.Thread(target=download_chunk(1)).start()
thread_3 = threading.Thread(target=download_chunk(2)).start()
thread_4 = threading.Thread(target=download_chunk(3)).start()
您 运行 download_chunk
在 thread/process 之外。您需要单独提供函数和参数才能延迟执行:
例如:
Process(target=download_chunk, args=(0,))
Refer to the multiprocessing docs for more information about using the multiprocessing.Process
class.
对于这个用例,我建议使用 multiprocessing.Pool
:
from multiprocessing import Pool
if __name__ == '__main__':
with Pool(4) as pool:
pool.map(download_chunk, range(4))
它处理创建、启动和稍后加入 4 个进程的工作。每个进程使用可迭代对象中提供的每个参数调用 download_chunk
,在本例中为 range(4)
。
More info about multiprocessing.Pool
can be found in the docs.