使用 boto 将多个文件并行上传到 s3

Question

http://ls.pwd.io/2013/06/parallel-s3-uploads-using-boto-and-threads-in-python/

我尝试了link中提到的第二种解决方案，将多个文件上传到s3。 link 中提到的代码没有在线程上调用方法 "join"，这意味着即使线程是运行，主程序也可以终止。使用这种方法，整个程序的执行速度会快得多，但不能保证文件是否正确上传。这是真的吗？我比较关心的是主程序整理的快吗？使用这种方法会有什么副作用？

Answer 1

只是玩了一会儿，我看到 multiprocessing 需要一段时间才能拆掉一个游泳池，但除此之外没什么用

测试代码为：

from time import time, sleep
from multiprocessing.pool import Pool, ThreadPool
from threading import Thread


N_WORKER_JOBS = 10


def worker(x):
    # print("working on", x)
    sleep(0.1)


def mp_proc(fn, n):
    start = time()
    with Pool(N_WORKER_JOBS) as pool:
        t1 = time() - start
        pool.map(fn, range(n))
        start = time()
    t2 = time() - start
    print(f'Pool creation took {t1*1000:.2f}ms, teardown {t2*1000:.2f}ms')


def mp_threads(fn, n):
    start = time()
    with ThreadPool(N_WORKER_JOBS) as pool:
        t1 = time() - start
        pool.map(fn, range(n))
        start = time()
    t2 = time() - start
    print(f'ThreadPool creation took {t1*1000:.2f}ms, teardown {t2*1000:.2f}ms')


def threads(fn, n):
    threads = []
    for i in range(n):
        t = Thread(target=fn, args=(i,))
        threads.append(t)
        t.start()
    for t in threads:
        t.join()


for test in [mp_proc, mp_threads, threads]:
    times = []
    for _ in range(7):
        start = time()
        test(worker, 10)
        times.append(time() - start)

    times = ', '.join(f'{t*1000:.2f}' for t in times)
    print(f'{test.__name__} took {times}ms')

我得到以下时间（Python 3.7.3，Linux 5.0.8）：

mp_proc~220ms
mp_threads~200ms
threads~100ms

然而，拆卸时间都是 ~100 毫秒，这使一切都符合要求。

我查了一下日志记录和源代码，这似乎是由于 _handle_workers 仅每 100 毫秒检查一次（它进行状态检查然后休眠 0.1 秒）。

有了这些知识，我可以将代码更改为休眠 0.095 秒，然后一切都在 10% 以内。另外，考虑到这只是在池中拆除一次，所以很容易安排它不会在内循环中发生

使用 boto 将多个文件并行上传到 s3

Uploading the multiples files in parallel to s3 using boto

python

amazon-s3

amazon-web-services

python-multithreading

boto3