如何在 Python 2 中下载大文件
How to download large files in Python 2
我正在尝试使用 mechanize 模块下载大文件(大约 1GB),但一直没有成功。我一直在搜索类似的线程,但我只找到了那些文件可以公开访问并且无需登录即可获取文件的线程。但这不是我的情况,因为文件位于私人部分,我需要在下载前登录。这是我到目前为止所做的。
import mechanize
g_form_id = ""
def is_form_found(form1):
return "id" in form1.attrs and form1.attrs['id'] == g_form_id
def select_form_with_id_using_br(br1, id1):
global g_form_id
g_form_id = id1
try:
br1.select_form(predicate=is_form_found)
except mechanize.FormNotFoundError:
print "form not found, id: " + g_form_id
exit()
url_to_login = "https://example.com/"
url_to_file = "https://example.com/download/files/filename=fname.exe"
local_filename = "fname.exe"
br = mechanize.Browser()
br.set_handle_robots(False) # ignore robots
br.set_handle_refresh(False) # can sometimes hang without this
br.addheaders = [('User-agent', 'Firefox')]
response = br.open(url_to_login)
# Find login form
select_form_with_id_using_br(br, 'login-form')
# Fill in data
br.form['email'] = 'email@domain.com'
br.form['password'] = 'password'
br.set_all_readonly(False) # allow everything to be written to
br.submit()
# Try to download file
br.retrieve(url_to_file, local_filename)
但我在下载 512MB 时遇到错误:
Traceback (most recent call last):
File "dl.py", line 34, in <module>
br.retrieve(br.retrieve(url_to_file, local_filename)
File "C:\Python27\lib\site-packages\mechanize\_opener.py", line 277, in retrieve
block = fp.read(bs)
File "C:\Python27\lib\site-packages\mechanize\_response.py", line 199, in read
self.__cache.write(data)
MemoryError: out of memory
你有什么解决办法吗?
谢谢
分块尝试downloading/writing。好像文件占用了你所有的记忆。
如果服务器支持,您应该为您的请求指定范围header。
您可以使用 bs4 and requests 登录然后写入 streamed 内容。需要一些表单字段,包括绝对必要的 _token_
字段:
from bs4 import BeautifulSoup
import requests
from urlparse import urljoin
data = {'email': 'email@domain.com', 'password': 'password'}
base = "https://support.codasip.com"
with requests.Session() as s:
# update headers
s.headers.update({'User-agent': 'Firefox'})
# use bs4 to parse the from fields
soup = BeautifulSoup(s.get(base).content)
form = soup.select_one("#frm-loginForm")
# works as it is a relative path. Not always the case.
action = form["action"]
# Get rest of the fields, ignore password and email.
for inp in form.find_all("input", {"name":True,"value":True}):
name, value = inp["name"], inp["value"]
if name not in data:
data[name] = value
# login
s.post(urljoin(base, action), data=data)
# get protected url
with open(local_filename, "wb") as f:
for chk in s.get(url_to_file, stream=True).iter_content(1024):
f.write(chk)
我正在尝试使用 mechanize 模块下载大文件(大约 1GB),但一直没有成功。我一直在搜索类似的线程,但我只找到了那些文件可以公开访问并且无需登录即可获取文件的线程。但这不是我的情况,因为文件位于私人部分,我需要在下载前登录。这是我到目前为止所做的。
import mechanize
g_form_id = ""
def is_form_found(form1):
return "id" in form1.attrs and form1.attrs['id'] == g_form_id
def select_form_with_id_using_br(br1, id1):
global g_form_id
g_form_id = id1
try:
br1.select_form(predicate=is_form_found)
except mechanize.FormNotFoundError:
print "form not found, id: " + g_form_id
exit()
url_to_login = "https://example.com/"
url_to_file = "https://example.com/download/files/filename=fname.exe"
local_filename = "fname.exe"
br = mechanize.Browser()
br.set_handle_robots(False) # ignore robots
br.set_handle_refresh(False) # can sometimes hang without this
br.addheaders = [('User-agent', 'Firefox')]
response = br.open(url_to_login)
# Find login form
select_form_with_id_using_br(br, 'login-form')
# Fill in data
br.form['email'] = 'email@domain.com'
br.form['password'] = 'password'
br.set_all_readonly(False) # allow everything to be written to
br.submit()
# Try to download file
br.retrieve(url_to_file, local_filename)
但我在下载 512MB 时遇到错误:
Traceback (most recent call last):
File "dl.py", line 34, in <module>
br.retrieve(br.retrieve(url_to_file, local_filename)
File "C:\Python27\lib\site-packages\mechanize\_opener.py", line 277, in retrieve
block = fp.read(bs)
File "C:\Python27\lib\site-packages\mechanize\_response.py", line 199, in read
self.__cache.write(data)
MemoryError: out of memory
你有什么解决办法吗? 谢谢
分块尝试downloading/writing。好像文件占用了你所有的记忆。
如果服务器支持,您应该为您的请求指定范围header。
您可以使用 bs4 and requests 登录然后写入 streamed 内容。需要一些表单字段,包括绝对必要的 _token_
字段:
from bs4 import BeautifulSoup
import requests
from urlparse import urljoin
data = {'email': 'email@domain.com', 'password': 'password'}
base = "https://support.codasip.com"
with requests.Session() as s:
# update headers
s.headers.update({'User-agent': 'Firefox'})
# use bs4 to parse the from fields
soup = BeautifulSoup(s.get(base).content)
form = soup.select_one("#frm-loginForm")
# works as it is a relative path. Not always the case.
action = form["action"]
# Get rest of the fields, ignore password and email.
for inp in form.find_all("input", {"name":True,"value":True}):
name, value = inp["name"], inp["value"]
if name not in data:
data[name] = value
# login
s.post(urljoin(base, action), data=data)
# get protected url
with open(local_filename, "wb") as f:
for chk in s.get(url_to_file, stream=True).iter_content(1024):
f.write(chk)