Python 请求或 urllib 读取超时,URL 编码问题?
Python requests or urllib read timeout, URL encoding issue?
我正在尝试从 Python 中下载文件,我尝试了 urllib 和 requests,但都出现超时错误。该文件位于:http://www.prociv.pt/cnos/HAI/Setembro/Incêndios%20Rurais%20-%20Histórico%20do%20Dia%2029SET.pdf
使用请求:
r = requests.get('http://www.prociv.pt/cnos/HAI/Setembro/Incêndios%20Rurais%20-%20Histórico%20do%20Dia%2029SET.pdf',timeout=60.0)
使用 urllib:
urllib.urlretrieve('http://www.prociv.pt/cnos/HAI/Setembro/Incêndios%20Rurais%20-%20Histórico%20do%20Dia%2029SET.pdf','the.pdf')
我尝试了不同的 URL,例如:
- http://www.prociv.pt/cnos/HAI/Setembro/Incêndios Rurais - Histórico do Dia 29SET.pdf
- http://www.prociv.pt/cnos/HAI/Setembro/Inc%C3%AAndios%20Rurais%20-%20Hist%C3%B3rico%20do%20Dia%2029SET.pdf
- http://www.prociv.pt/cnos/HAI/Setembro/Incêndios%20Rurais%20-%20Histórico%20do%20Dia%2029SET.pdf
而且,我可以使用浏览器下载它,也可以使用以下语法通过 cURL 下载它:
curl http://www.prociv.pt/cnos/HAI/Setembro/Inc%C3%AAndios%20Rurais%20-%20Hist%C3%B3rico%20do%20Dia%2029SET.pdf
所以我怀疑这是一个编码问题,但我似乎无法让它工作。有什么建议吗?
编辑:清晰度。
看起来服务器的响应因客户端而异 User-Agent。如果您指定自定义 User-Agent
header,服务器将以 PDF 响应:
import requests
import shutil
url = 'http://www.prociv.pt/cnos/HAI/Setembro/Inc%C3%AAndios%20Rurais%20-%20Hist%C3%B3rico%20do%20Dia%2028SET.pdf'
headers = {'User-Agent': 'curl'} # wink-wink
response = requests.get(url, headers=headers, stream=True)
if response.status_code == 200:
with open('result.pdf', 'wb') as output:
response.raw.decode_content = True
shutil.copyfileobj(response.raw, output)
演示:
>>> import requests
>>> url = 'http://www.prociv.pt/cnos/HAI/Setembro/Inc%C3%AAndios%20Rurais%20-%20Hist%C3%B3rico%20do%20Dia%2028SET.pdf'
>>> headers = {'User-Agent': 'curl'} # wink-wink
>>> response = requests.get(url, headers=headers, stream=True)
>>> response.headers['content-type']
'application/pdf'
>>> response.headers['content-length']
'466191'
>>> response.raw.read(100)
'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n<</Type/Catalog/Pages 2 0 R/Lang(pt-PT) /StructTreeRoot 37 0 R/MarkInfo<</'
我的猜测是有人滥用 Python 脚本从该服务器下载了太多文件,而你 tar-pitted 仅基于 User-Agent header .
我正在尝试从 Python 中下载文件,我尝试了 urllib 和 requests,但都出现超时错误。该文件位于:http://www.prociv.pt/cnos/HAI/Setembro/Incêndios%20Rurais%20-%20Histórico%20do%20Dia%2029SET.pdf
使用请求:
r = requests.get('http://www.prociv.pt/cnos/HAI/Setembro/Incêndios%20Rurais%20-%20Histórico%20do%20Dia%2029SET.pdf',timeout=60.0)
使用 urllib:
urllib.urlretrieve('http://www.prociv.pt/cnos/HAI/Setembro/Incêndios%20Rurais%20-%20Histórico%20do%20Dia%2029SET.pdf','the.pdf')
我尝试了不同的 URL,例如:
- http://www.prociv.pt/cnos/HAI/Setembro/Incêndios Rurais - Histórico do Dia 29SET.pdf
- http://www.prociv.pt/cnos/HAI/Setembro/Inc%C3%AAndios%20Rurais%20-%20Hist%C3%B3rico%20do%20Dia%2029SET.pdf
- http://www.prociv.pt/cnos/HAI/Setembro/Incêndios%20Rurais%20-%20Histórico%20do%20Dia%2029SET.pdf
而且,我可以使用浏览器下载它,也可以使用以下语法通过 cURL 下载它:
curl http://www.prociv.pt/cnos/HAI/Setembro/Inc%C3%AAndios%20Rurais%20-%20Hist%C3%B3rico%20do%20Dia%2029SET.pdf
所以我怀疑这是一个编码问题,但我似乎无法让它工作。有什么建议吗?
编辑:清晰度。
看起来服务器的响应因客户端而异 User-Agent。如果您指定自定义 User-Agent
header,服务器将以 PDF 响应:
import requests
import shutil
url = 'http://www.prociv.pt/cnos/HAI/Setembro/Inc%C3%AAndios%20Rurais%20-%20Hist%C3%B3rico%20do%20Dia%2028SET.pdf'
headers = {'User-Agent': 'curl'} # wink-wink
response = requests.get(url, headers=headers, stream=True)
if response.status_code == 200:
with open('result.pdf', 'wb') as output:
response.raw.decode_content = True
shutil.copyfileobj(response.raw, output)
演示:
>>> import requests
>>> url = 'http://www.prociv.pt/cnos/HAI/Setembro/Inc%C3%AAndios%20Rurais%20-%20Hist%C3%B3rico%20do%20Dia%2028SET.pdf'
>>> headers = {'User-Agent': 'curl'} # wink-wink
>>> response = requests.get(url, headers=headers, stream=True)
>>> response.headers['content-type']
'application/pdf'
>>> response.headers['content-length']
'466191'
>>> response.raw.read(100)
'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n<</Type/Catalog/Pages 2 0 R/Lang(pt-PT) /StructTreeRoot 37 0 R/MarkInfo<</'
我的猜测是有人滥用 Python 脚本从该服务器下载了太多文件,而你 tar-pitted 仅基于 User-Agent header .