python如何同时下载多个大文件?
How to download multiple large files concurrently in python?
我正在尝试从 CommonCrawl 数据库下载一系列 Warc 文件,每个大约 25mb。这是我的脚本:
import json
import urllib.request
from urllib.error import HTTPError
from src.Util import rooted
with open(rooted('data/alexa.txt'), 'r') as alexa:
for i, url in enumerate(alexa):
if i % 1000 == 0:
try:
request = 'http://index.commoncrawl.org/CC-MAIN-2018-13-index?url={search}*&output=json' \
.format(search=url.rstrip())
page = urllib.request.urlopen(request)
for line in page:
result = json.loads(line)
urllib.request.urlretrieve('https://commoncrawl.s3.amazonaws.com/%s' % result['filename'],
rooted('data/warc/%s' % ''.join(c for c in result['url'] if c.isalnum())))
except HTTPError:
pass
目前正在做的是请求 link 通过 CommonCrawl REST API 下载 Warc 文件,然后开始下载到 'data/warc' 文件夹。
问题在于,在每次 urllib.request.urlretrieve()
调用中,程序都会挂起,直到文件完全下载完毕,然后才会发出下一个下载请求。有什么方法可以在下载发出后立即终止 urllib.request.urlretrieve()
调用,然后下载文件或以某种方式为每个请求旋转一个新线程并同时下载所有文件?
谢谢
使用线程,futures
甚至:)
jobs = []
with ThreadPoolExecutor(max_workers=100) as executor:
for line in page:
future = executor.submit(urllib.request.urlretrieve,
'https://commoncrawl.s3.amazonaws.com/%s' % result['filename'],
rooted('data/warc/%s' % ''.join(c for c in result['url'] if c.isalnum()))
jobs.append(future)
...
for f in jobs:
print(f.result())
在此处阅读更多内容:https://docs.python.org/3/library/concurrent.futures.html
我正在尝试从 CommonCrawl 数据库下载一系列 Warc 文件,每个大约 25mb。这是我的脚本:
import json
import urllib.request
from urllib.error import HTTPError
from src.Util import rooted
with open(rooted('data/alexa.txt'), 'r') as alexa:
for i, url in enumerate(alexa):
if i % 1000 == 0:
try:
request = 'http://index.commoncrawl.org/CC-MAIN-2018-13-index?url={search}*&output=json' \
.format(search=url.rstrip())
page = urllib.request.urlopen(request)
for line in page:
result = json.loads(line)
urllib.request.urlretrieve('https://commoncrawl.s3.amazonaws.com/%s' % result['filename'],
rooted('data/warc/%s' % ''.join(c for c in result['url'] if c.isalnum())))
except HTTPError:
pass
目前正在做的是请求 link 通过 CommonCrawl REST API 下载 Warc 文件,然后开始下载到 'data/warc' 文件夹。
问题在于,在每次 urllib.request.urlretrieve()
调用中,程序都会挂起,直到文件完全下载完毕,然后才会发出下一个下载请求。有什么方法可以在下载发出后立即终止 urllib.request.urlretrieve()
调用,然后下载文件或以某种方式为每个请求旋转一个新线程并同时下载所有文件?
谢谢
使用线程,futures
甚至:)
jobs = []
with ThreadPoolExecutor(max_workers=100) as executor:
for line in page:
future = executor.submit(urllib.request.urlretrieve,
'https://commoncrawl.s3.amazonaws.com/%s' % result['filename'],
rooted('data/warc/%s' % ''.join(c for c in result['url'] if c.isalnum()))
jobs.append(future)
...
for f in jobs:
print(f.result())
在此处阅读更多内容:https://docs.python.org/3/library/concurrent.futures.html