requests.get return 错误 HTTPSConnectionPool Python

Question

下面的代码需要 return 200，但某些域会出错。

import requests    
url1 = 'https://www.pontofrio.com.br/'
                    headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) ' 
                  'AppleWebKit/537.11 (KHTML, like Gecko) '
                  'Chrome/23.0.1271.64 Safari/537.11',
                  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
                  'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
                  'Accept-Encoding': 'none',
                  'Accept-Language': 'en-US,en;q=0.8',
                  'Connection': 'keep-alive'}
response = requests.get(url1, headers, timeout=10)
print(response.status_code)

Return:

Traceback (most recent call last):
  File "C:\Python34\lib\site-packages\urllib3\connectionpool.py", line 384, in _make_request
    six.raise_from(e, None)
  File "<string>", line 2, in raise_from
  File "C:\Python34\lib\site-packages\urllib3\connectionpool.py", line 380, in _make_request
    httplib_response = conn.getresponse()
  File "C:\Python34\lib\http\client.py", line 1148, in getresponse
    response.begin()
  File "C:\Python34\lib\http\client.py", line 352, in begin
    version, status, reason = self._read_status()
  File "C:\Python34\lib\http\client.py", line 314, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "C:\Python34\lib\socket.py", line 371, in readinto
    return self._sock.recv_into(b)
  File "C:\Python34\lib\site-packages\urllib3\contrib\pyopenssl.py", line 309, in recv_into
    return self.recv_into(*args, **kwargs)
  File "C:\Python34\lib\site-packages\urllib3\contrib\pyopenssl.py", line 307, in recv_into
    raise timeout('The read operation timed out')
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Python34\lib\site-packages\requests\adapters.py", line 449, in send
    timeout=timeout
  File "C:\Python34\lib\site-packages\urllib3\connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "C:\Python34\lib\site-packages\urllib3\util\retry.py", line 367, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "C:\Python34\lib\site-packages\urllib3\packages\six.py", line 686, in reraise
    raise value
  File "C:\Python34\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "C:\Python34\lib\site-packages\urllib3\connectionpool.py", line 386, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "C:\Python34\lib\site-packages\urllib3\connectionpool.py", line 306, in _raise_timeout
    raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='www.pontofrio.com.br', port=443): Read timed out. (read timeout=10)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:/teste.py", line 219, in <module>
    url = montaurl(dominio)
  File "c:/teste.py", line 81, in montaurl
    response = requests.get(url1, headers, timeout=10)
  File "C:\Python34\lib\site-packages\requests\api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Python34\lib\site-packages\requests\api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\Python34\lib\site-packages\requests\sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Python34\lib\site-packages\requests\sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "C:\Python34\lib\site-packages\requests\adapters.py", line 529, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='www.pontofrio.com.br', port=443): Read timed out. (read timeout=10)

有效的域：

https://www.pichau.com.br/

无效的域：

casasbahia.com.br
extra.com.br
boticario.com.br

我相信它是 pontofrio 服务器上的某个块，我该如何解决这个问题？

Answer 1

我测试过使用 wget 访问页面但没有成功。问题似乎是服务器仅响应 HTTP/2 请求。

测试 curl:

这个超时:

$ curl --http1.1 -A "Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/81.0" "https://www.pontofrio.com.br/"

# times out

这个成功（注意--http2参数）：

$ curl --http2 -A "Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/81.0" "https://www.pontofrio.com.br/"

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">

...

很遗憾，requests 模块不支持它。但是，您可以使用具有实验性 HTTP/2 支持的 httpx 模块：

import httpx
import asyncio
    
headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0",
}

async def get_text(url):
    async with httpx.AsyncClient(http2=True, headers=headers) as client:
        r = await client.get(url)
        return r.text


txt = asyncio.run(get_text("https://www.pontofrio.com.br/"))
print(txt)

打印：

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">


...

要安装支持 HTTP/2 的 httpx 模块，例如使用 pip install httpx[http2]

Answer 2

似乎有几个问题，第一个是 header 的设置方式。下面实际上并没有将自定义 headers 传递给 requests.get 函数。

response = requests.get(url1, headers, timeout=10)

这可以针对 httpbin 进行测试：

import requests    
url1 = 'https://httpbin.org/headers'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) ' 
                'AppleWebKit/537.11 (KHTML, like Gecko) '
                'Chrome/23.0.1271.64 Safari/537.11',
                'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
                'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
                'Accept-Encoding': 'none',
                'Accept-Language': 'en-US,en;q=0.8',
                'Connection': 'keep-alive'
}
response = requests.get(url1, headers, timeout=10)
print(response.text)
print(response.status_code)

输出：

{
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.25.1", 
    "X-Amzn-Trace-Id": "Root=1-608a0391-3f1cfa79444ac04865ad9111"
  }
}

200

要正确设置 headers 参数：

response = requests.get(url1, headers=headers, timeout=10)

让我们测试一下：

import requests    
url1 = 'https://httpbin.org/headers'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) ' 
                'AppleWebKit/537.11 (KHTML, like Gecko) '
                'Chrome/23.0.1271.64 Safari/537.11',
                'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
                'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
                'Accept-Encoding': 'none',
                'Accept-Language': 'en-US,en;q=0.8',
                'Connection': 'keep-alive'
}
response = requests.get(url1, headers=headers, timeout=10)
print(response.text)
print(response.status_code)

这是输出：

{
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
    "Accept-Charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.3", 
    "Accept-Encoding": "none", 
    "Accept-Language": "en-US,en;q=0.8", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11", 
    "X-Amzn-Trace-Id": "Root=1-608a0533-40c8281f5faa85d1050c6b6a"
  }
}

200

最后，header 和 'Connection': 'keep-alive' header 的顺序尤其引起了问题。在我重新排序并删除 Connection header 后，它开始处理所有网址。

这是我用来测试的代码：

import requests    
urls = ['https://www.pontofrio.com.br/', 
        'https://www.casasbahia.com.br', 
        'https://www.extra.com.br', 
        'https://www.boticario.com.br']
headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
                  'Accept-Encoding': 'gzip, deflate, br',
                  'Accept-Language': 'en-US,en;q=0.9',
                  'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4491.0 Safari/537.36'}
for url1 in urls:
    print("Trying url: %s"% url1)
    response = requests.get(url1, headers=headers, timeout=10)
    print(response.status_code)

并且输出：

Trying url: https://www.pontofrio.com.br/
200
Trying url: https://www.casasbahia.com.br
200
Trying url: https://www.extra.com.br
200
Trying url: https://www.boticario.com.br
200

requests.get return 错误 HTTPSConnectionPool Python

requests.get return error HTTPSConnectionPool Python

timeout

request

python-3.x

python-requests