requests.get return 错误 HTTPSConnectionPool Python
requests.get return error HTTPSConnectionPool Python
下面的代码需要 return 200,但某些域会出错。
import requests
url1 = 'https://www.pontofrio.com.br/'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '
'AppleWebKit/537.11 (KHTML, like Gecko) '
'Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
response = requests.get(url1, headers, timeout=10)
print(response.status_code)
Return:
Traceback (most recent call last):
File "C:\Python34\lib\site-packages\urllib3\connectionpool.py", line 384, in _make_request
six.raise_from(e, None)
File "<string>", line 2, in raise_from
File "C:\Python34\lib\site-packages\urllib3\connectionpool.py", line 380, in _make_request
httplib_response = conn.getresponse()
File "C:\Python34\lib\http\client.py", line 1148, in getresponse
response.begin()
File "C:\Python34\lib\http\client.py", line 352, in begin
version, status, reason = self._read_status()
File "C:\Python34\lib\http\client.py", line 314, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "C:\Python34\lib\socket.py", line 371, in readinto
return self._sock.recv_into(b)
File "C:\Python34\lib\site-packages\urllib3\contrib\pyopenssl.py", line 309, in recv_into
return self.recv_into(*args, **kwargs)
File "C:\Python34\lib\site-packages\urllib3\contrib\pyopenssl.py", line 307, in recv_into
raise timeout('The read operation timed out')
socket.timeout: The read operation timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Python34\lib\site-packages\requests\adapters.py", line 449, in send
timeout=timeout
File "C:\Python34\lib\site-packages\urllib3\connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
File "C:\Python34\lib\site-packages\urllib3\util\retry.py", line 367, in increment
raise six.reraise(type(error), error, _stacktrace)
File "C:\Python34\lib\site-packages\urllib3\packages\six.py", line 686, in reraise
raise value
File "C:\Python34\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopen
chunked=chunked)
File "C:\Python34\lib\site-packages\urllib3\connectionpool.py", line 386, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "C:\Python34\lib\site-packages\urllib3\connectionpool.py", line 306, in _raise_timeout
raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='www.pontofrio.com.br', port=443): Read timed out. (read timeout=10)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "c:/teste.py", line 219, in <module>
url = montaurl(dominio)
File "c:/teste.py", line 81, in montaurl
response = requests.get(url1, headers, timeout=10)
File "C:\Python34\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Python34\lib\site-packages\requests\api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Python34\lib\site-packages\requests\sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "C:\Python34\lib\site-packages\requests\sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "C:\Python34\lib\site-packages\requests\adapters.py", line 529, in send
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='www.pontofrio.com.br', port=443): Read timed out. (read timeout=10)
有效的域:
无效的域:
- casasbahia.com.br
- extra.com.br
- boticario.com.br
我相信它是 pontofrio 服务器上的某个块,我该如何解决这个问题?
我测试过使用 wget
访问页面但没有成功。问题似乎是服务器仅响应 HTTP/2
请求。
测试 curl
:
这个超时:
$ curl --http1.1 -A "Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/81.0" "https://www.pontofrio.com.br/"
# times out
这个成功(注意--http2
参数):
$ curl --http2 -A "Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/81.0" "https://www.pontofrio.com.br/"
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
...
很遗憾,requests
模块不支持它。但是,您可以使用具有实验性 HTTP/2 支持的 httpx
模块:
import httpx
import asyncio
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0",
}
async def get_text(url):
async with httpx.AsyncClient(http2=True, headers=headers) as client:
r = await client.get(url)
return r.text
txt = asyncio.run(get_text("https://www.pontofrio.com.br/"))
print(txt)
打印:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
...
要安装支持 HTTP/2 的 httpx
模块,例如使用 pip install httpx[http2]
似乎有几个问题,第一个是 header 的设置方式。下面实际上并没有将自定义 headers 传递给 requests.get 函数。
response = requests.get(url1, headers, timeout=10)
这可以针对 httpbin 进行测试:
import requests
url1 = 'https://httpbin.org/headers'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '
'AppleWebKit/537.11 (KHTML, like Gecko) '
'Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'
}
response = requests.get(url1, headers, timeout=10)
print(response.text)
print(response.status_code)
输出:
{
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.25.1",
"X-Amzn-Trace-Id": "Root=1-608a0391-3f1cfa79444ac04865ad9111"
}
}
200
要正确设置 headers 参数:
response = requests.get(url1, headers=headers, timeout=10)
让我们测试一下:
import requests
url1 = 'https://httpbin.org/headers'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '
'AppleWebKit/537.11 (KHTML, like Gecko) '
'Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'
}
response = requests.get(url1, headers=headers, timeout=10)
print(response.text)
print(response.status_code)
这是输出:
{
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.3",
"Accept-Encoding": "none",
"Accept-Language": "en-US,en;q=0.8",
"Host": "httpbin.org",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
"X-Amzn-Trace-Id": "Root=1-608a0533-40c8281f5faa85d1050c6b6a"
}
}
200
最后,header 和 'Connection': 'keep-alive'
header 的顺序尤其引起了问题。在我重新排序并删除 Connection
header 后,它开始处理所有网址。
这是我用来测试的代码:
import requests
urls = ['https://www.pontofrio.com.br/',
'https://www.casasbahia.com.br',
'https://www.extra.com.br',
'https://www.boticario.com.br']
headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4491.0 Safari/537.36'}
for url1 in urls:
print("Trying url: %s"% url1)
response = requests.get(url1, headers=headers, timeout=10)
print(response.status_code)
并且输出:
Trying url: https://www.pontofrio.com.br/
200
Trying url: https://www.casasbahia.com.br
200
Trying url: https://www.extra.com.br
200
Trying url: https://www.boticario.com.br
200
下面的代码需要 return 200,但某些域会出错。
import requests
url1 = 'https://www.pontofrio.com.br/'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '
'AppleWebKit/537.11 (KHTML, like Gecko) '
'Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
response = requests.get(url1, headers, timeout=10)
print(response.status_code)
Return:
Traceback (most recent call last):
File "C:\Python34\lib\site-packages\urllib3\connectionpool.py", line 384, in _make_request
six.raise_from(e, None)
File "<string>", line 2, in raise_from
File "C:\Python34\lib\site-packages\urllib3\connectionpool.py", line 380, in _make_request
httplib_response = conn.getresponse()
File "C:\Python34\lib\http\client.py", line 1148, in getresponse
response.begin()
File "C:\Python34\lib\http\client.py", line 352, in begin
version, status, reason = self._read_status()
File "C:\Python34\lib\http\client.py", line 314, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "C:\Python34\lib\socket.py", line 371, in readinto
return self._sock.recv_into(b)
File "C:\Python34\lib\site-packages\urllib3\contrib\pyopenssl.py", line 309, in recv_into
return self.recv_into(*args, **kwargs)
File "C:\Python34\lib\site-packages\urllib3\contrib\pyopenssl.py", line 307, in recv_into
raise timeout('The read operation timed out')
socket.timeout: The read operation timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Python34\lib\site-packages\requests\adapters.py", line 449, in send
timeout=timeout
File "C:\Python34\lib\site-packages\urllib3\connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
File "C:\Python34\lib\site-packages\urllib3\util\retry.py", line 367, in increment
raise six.reraise(type(error), error, _stacktrace)
File "C:\Python34\lib\site-packages\urllib3\packages\six.py", line 686, in reraise
raise value
File "C:\Python34\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopen
chunked=chunked)
File "C:\Python34\lib\site-packages\urllib3\connectionpool.py", line 386, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "C:\Python34\lib\site-packages\urllib3\connectionpool.py", line 306, in _raise_timeout
raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='www.pontofrio.com.br', port=443): Read timed out. (read timeout=10)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "c:/teste.py", line 219, in <module>
url = montaurl(dominio)
File "c:/teste.py", line 81, in montaurl
response = requests.get(url1, headers, timeout=10)
File "C:\Python34\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Python34\lib\site-packages\requests\api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Python34\lib\site-packages\requests\sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "C:\Python34\lib\site-packages\requests\sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "C:\Python34\lib\site-packages\requests\adapters.py", line 529, in send
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='www.pontofrio.com.br', port=443): Read timed out. (read timeout=10)
有效的域:
无效的域:
- casasbahia.com.br
- extra.com.br
- boticario.com.br
我相信它是 pontofrio 服务器上的某个块,我该如何解决这个问题?
我测试过使用 wget
访问页面但没有成功。问题似乎是服务器仅响应 HTTP/2
请求。
测试 curl
:
这个超时:
$ curl --http1.1 -A "Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/81.0" "https://www.pontofrio.com.br/"
# times out
这个成功(注意--http2
参数):
$ curl --http2 -A "Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/81.0" "https://www.pontofrio.com.br/"
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
...
很遗憾,requests
模块不支持它。但是,您可以使用具有实验性 HTTP/2 支持的 httpx
模块:
import httpx
import asyncio
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0",
}
async def get_text(url):
async with httpx.AsyncClient(http2=True, headers=headers) as client:
r = await client.get(url)
return r.text
txt = asyncio.run(get_text("https://www.pontofrio.com.br/"))
print(txt)
打印:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
...
要安装支持 HTTP/2 的 httpx
模块,例如使用 pip install httpx[http2]
似乎有几个问题,第一个是 header 的设置方式。下面实际上并没有将自定义 headers 传递给 requests.get 函数。
response = requests.get(url1, headers, timeout=10)
这可以针对 httpbin 进行测试:
import requests
url1 = 'https://httpbin.org/headers'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '
'AppleWebKit/537.11 (KHTML, like Gecko) '
'Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'
}
response = requests.get(url1, headers, timeout=10)
print(response.text)
print(response.status_code)
输出:
{
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.25.1",
"X-Amzn-Trace-Id": "Root=1-608a0391-3f1cfa79444ac04865ad9111"
}
}
200
要正确设置 headers 参数:
response = requests.get(url1, headers=headers, timeout=10)
让我们测试一下:
import requests
url1 = 'https://httpbin.org/headers'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '
'AppleWebKit/537.11 (KHTML, like Gecko) '
'Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'
}
response = requests.get(url1, headers=headers, timeout=10)
print(response.text)
print(response.status_code)
这是输出:
{
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.3",
"Accept-Encoding": "none",
"Accept-Language": "en-US,en;q=0.8",
"Host": "httpbin.org",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
"X-Amzn-Trace-Id": "Root=1-608a0533-40c8281f5faa85d1050c6b6a"
}
}
200
最后,header 和 'Connection': 'keep-alive'
header 的顺序尤其引起了问题。在我重新排序并删除 Connection
header 后,它开始处理所有网址。
这是我用来测试的代码:
import requests
urls = ['https://www.pontofrio.com.br/',
'https://www.casasbahia.com.br',
'https://www.extra.com.br',
'https://www.boticario.com.br']
headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4491.0 Safari/537.36'}
for url1 in urls:
print("Trying url: %s"% url1)
response = requests.get(url1, headers=headers, timeout=10)
print(response.status_code)
并且输出:
Trying url: https://www.pontofrio.com.br/
200
Trying url: https://www.casasbahia.com.br
200
Trying url: https://www.extra.com.br
200
Trying url: https://www.boticario.com.br
200