requests.adapters.HTTPAdapter中的pool_connections是什么意思?

What's the meaning of pool_connections in requests.adapters.HTTPAdapter?

初始化一个请求时Session,两个HTTPAdapter will be created and mount to http and https.

HTTPAdapter 是这样定义的:

class requests.adapters.HTTPAdapter(pool_connections=10, pool_maxsize=10,
                                    max_retries=0, pool_block=False)

虽然我理解pool_maxsize(这是一个池可以保存的会话数)的含义,但我不明白pool_connections 的含义或作用。医生说:

Parameters: 
pool_connections – The number of urllib3 connection pools to cache.

但是 "to cache" 是什么意思?使用多个连接池有什么意义?

Requests 使用 urllib3 来管理其连接和其他功能。

Re-using 连接数是保持重复 HTTP 请求性能的重要因素。 The urllib3 README explains:

Why do I want to reuse connections?

Performance. When you normally do a urllib call, a separate socket connection is created with each request. By reusing existing sockets (supported since HTTP 1.1), the requests will take up less resources on the server's end, and also provide a faster response time at the client's end. [...]

为了回答您的问题,"pool_maxsize" 是每个主机要保持的连接数(这对 multi-threaded 应用程序很有用),而 "pool_connections" 是 host-pools 留在身边。例如,如果您要连接到 100 个不同的主机,并且 pool_connections=10,那么只有最近 10 个主机的连接是 re-used.

我写了一个关于这个的article。粘贴在这里:

请求的秘密:pool_connections 和 pool_maxsize

Requests 是 Python 程序员的 well-known Python third-party 库之一,如果不是最多的话。由于其简单API和高性能,人们倾向于使用请求而不是标准库为HTTP请求提供的urllib2。不过天天用requests的人可能不知道其中的内幕,今天要介绍的是其中的两个:pool_connectionspool_maxsize.

让我们从Session开始:

import requests

s = requests.Session()
s.get('https://www.google.com')

很简单。您可能知道请求的 Session 可以保留 cookie。凉爽的。但是你知道Session有一个mount方法吗?

mount(prefix, adapter)
Registers a connection adapter to a prefix.
Adapters are sorted in descending order by key length.

没有?好吧,事实上你在 initialize a Session object:

时已经使用了这个方法
class Session(SessionRedirectMixin):

    def __init__(self):
        ...
        # Default connection adapters.
        self.adapters = OrderedDict()
        self.mount('https://', HTTPAdapter())
        self.mount('http://', HTTPAdapter())

有趣的部分来了。如果您已阅读 Ian Cordasco 的文章 Retries in Requests, you should know that HTTPAdapter can be used to provide retry functionality. But what is an HTTPAdapter really? Quote from doc:

class requests.adapters.HTTPAdapter(pool_connections=10, pool_maxsize=10, max_retries=0, pool_block=False)

The built-in HTTP Adapter for urllib3.

Provides a general-case interface for Requests sessions to contact HTTP and HTTPS urls by implementing the Transport Adapter interface. This class will usually be created by the Session class under the covers.

Parameters:
* pool_connections – The number of urllib3 connection pools to cache. * pool_maxsize – The maximum number of connections to save in the pool. * max_retries(int) – The maximum number of retries each connection should attempt. Note, this applies only to failed DNS lookups, socket connections and connection timeouts, never to requests where data has made it to the server. By default, Requests does not retry failed connections. If you need granular control over the conditions under which we retry a request, import urllib3’s Retry class and pass that instead. * pool_block – Whether the connection pool should block for connections. Usage:

>>> import requests
>>> s = requests.Session()
>>> a = requests.adapters.HTTPAdapter(max_retries=3)
>>> s.mount('http://', a)

如果上面的文档让你感到困惑,我的解释是:HTTP Adapter 所做的只是根据目标url为不同的请求提供不同的配置。还记得上面的代码吗?

self.mount('https://', HTTPAdapter())
self.mount('http://', HTTPAdapter())

创建两个HTTPAdapter对象,默认参数pool_connections=10, pool_maxsize=10, max_retries=0, pool_block=False,分别挂载到https://http://,即配置第一个HTTPAdapter() 将在您尝试向 http://xxx 发送请求时使用,第二个 HTTPAdapter() 将用于向 https://xxx 发送请求。认为在这种情况下这两个配置是相同的,对 httphttps 的请求仍然是分开处理的。我们稍后会看到它的含义。

正如我所说,这篇文章的主要目的是解释pool_connectionspool_maxsize

先来看pool_connections。昨天我在Whosebug上提出了一个 因为我不确定我的理解是否正确,答案消除了我的不确定性。众所周知,HTTP是基于TCP协议的。一个HTTP连接也是一个TCP连接,由五个值的元组标识:

(<protocol>, <src addr>, <src port>, <dest addr>, <dest port>)

假设您已经与 www.example.com 建立了 HTTP/TCP 连接,假设服务器支持 Keep-Alive,下次您向 www.example.com/a 或 [=53= 发送请求时],您可以只使用相同的连接导致五个值的 none 发生变化。事实上,requests' Session automatically does this for you 并且会尽可能重用连接。

问题是,是什么决定了你是否可以重用旧连接?是的,pool_connections

pool_connections – The number of urllib3 connection pools to cache.

我知道我知道,我也不想带那么多术语,这是最后一个,我保证。为了便于理解,一个连接池对应一台主机,就是这样。

下面是一个例子(不相关的行被忽略):

s = requests.Session()
s.mount('https://', HTTPAdapter(pool_connections=1))
s.get('https://www.baidu.com')
s.get('https://www.zhihu.com')
s.get('https://www.baidu.com')

"""output
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.baidu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 None
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.zhihu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 2621
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.baidu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 None
"""

HTTPAdapter(pool_connections=1)挂载到https://,这意味着一次只有一个连接池持续存在。调用s.get('https://www.baidu.com')后,缓存的连接池为connectionpool('https://www.baidu.com')。现在s.get('https://www.zhihu.com')来了,会话发现不能使用之前缓存的连接,因为不是同一个主机(一个连接池对应一个主机,记得吗?)。因此会话必须创建一个新的连接池,或者如果你愿意的话。由于pool_connections=1,session不能同时持有两个连接池,因此它放弃了旧的connectionpool('https://www.baidu.com'),保留了新的connectionpool('https://www.zhihu.com')。接下来get也是一样。这就是为什么我们在日志记录中看到三个 Starting new HTTPS connection

如果我们将 pool_connections 设置为 2 会怎么样:

s = requests.Session()
s.mount('https://', HTTPAdapter(pool_connections=2))
s.get('https://www.baidu.com')
s.get('https://www.zhihu.com')
s.get('https://www.baidu.com')
"""output
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.baidu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 None
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.zhihu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 2623
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 None
"""

太好了,现在我们只创建了两次连接,节省了一次连接建立时间。

最后,pool_maxsize

首先,只有在 多线程 环境中使用 Session 时,您才应该关心 pool_maxsize,例如从多个线程发出并发请求使用 相同 Session.

其实pool_maxsize是初始化urllib3的HTTPConnectionPool的参数,也就是我们上面说的连接池。 HTTPConnectionPool 是一个容器,用于收集到特定主机的连接,pool_maxsize 是要保存的可以重复使用的连接数。如果您 运行 您的代码在一个线程中,则不可能也不需要创建到同一主机的多个连接,因为请求库正在阻塞,因此 HTTP 请求总是一个接一个地发送。

如果有多个线程,情况就不同了。

def thread_get(url):
    s.get(url)

s = requests.Session()
s.mount('https://', HTTPAdapter(pool_connections=1, pool_maxsize=2))
t1 = Thread(target=thread_get, args=('https://www.zhihu.com',))
t2 = Thread(target=thread_get, args=('https://www.zhihu.com/question/36612174',))
t1.start();t2.start()
t1.join();t2.join()
t3 = Thread(target=thread_get, args=('https://www.zhihu.com/question/39420364',))
t4 = Thread(target=thread_get, args=('https://www.zhihu.com/question/21362402',))
t3.start();t4.start()
"""output
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.zhihu.com
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (2): www.zhihu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/36612174 HTTP/1.1" 200 21906
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 2606
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/21362402 HTTP/1.1" 200 57556
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/39420364 HTTP/1.1" 200 28739
"""

看到了吗?它为同一个主机www.zhihu.com建立了两个连接,就像我说的,这只能发生在多线程环境中。 在这种情况下,我们创建一个pool_maxsize=2的连接池,并且同时连接不超过两个,所以就足够了。 我们可以看到来自 t3t4 的请求没有创建新连接,它们重用了旧连接。

尺寸不够怎么办?

s = requests.Session()
s.mount('https://', HTTPAdapter(pool_connections=1, pool_maxsize=1))
t1 = Thread(target=thread_get, args=('https://www.zhihu.com',))
t2 = Thread(target=thread_get, args=('https://www.zhihu.com/question/36612174',))
t1.start()
t2.start()
t1.join();t2.join()
t3 = Thread(target=thread_get, args=('https://www.zhihu.com/question/39420364',))
t4 = Thread(target=thread_get, args=('https://www.zhihu.com/question/21362402',))
t3.start();t4.start()
t3.join();t4.join()
"""output
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.zhihu.com
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (2): www.zhihu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/36612174 HTTP/1.1" 200 21906
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 2606
WARNING:requests.packages.urllib3.connectionpool:Connection pool is full, discarding connection: www.zhihu.com
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (3): www.zhihu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/39420364 HTTP/1.1" 200 28739
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/21362402 HTTP/1.1" 200 57556
WARNING:requests.packages.urllib3.connectionpool:Connection pool is full, discarding connection: www.zhihu.com
"""

现在,pool_maxsize=1,警告如期而至:

Connection pool is full, discarding connection: www.zhihu.com

我们还可以注意到,由于这个池中只能保存一个连接,所以为t3t4重新创建了一个新连接。明显通常这是非常低效的。这就是为什么在 urllib3 的文档中它说:

If you’re planning on using such a pool in a multithreaded environment, you should set the maxsize of the pool to a higher number, such as the number of threads.

最后但同样重要的是,安装到不同前缀的 HTTPAdapter 个实例是 独立的

s = requests.Session()
s.mount('https://', HTTPAdapter(pool_connections=1, pool_maxsize=2))
s.mount('https://baidu.com', HTTPAdapter(pool_connections=1, pool_maxsize=1))
t1 = Thread(target=thread_get, args=('https://www.zhihu.com',))
t2 =Thread(target=thread_get, args=('https://www.zhihu.com/question/36612174',))
t1.start();t2.start()
t1.join();t2.join()
t3 = Thread(target=thread_get, args=('https://www.zhihu.com/question/39420364',))
t4 = Thread(target=thread_get, args=('https://www.zhihu.com/question/21362402',))
t3.start();t4.start()
t3.join();t4.join()
"""output
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.zhihu.com
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (2): www.zhihu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/36612174 HTTP/1.1" 200 21906
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 2623
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/39420364 HTTP/1.1" 200 28739
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/21362402 HTTP/1.1" 200 57669
"""

上面的代码很容易理解就不解释了

我想就这些了。希望本文能帮助您更好地理解请求。顺便说一句,我创建了一个要点 here,其中包含本文中使用的所有测试代码。只需下载并使用它:)

附录

  1. 对于 https,请求使用 urllib3 的 HTTPSConnectionPool,但它与 HTTPConnectionPool 几乎相同,因此我在本文中不区分它们。
  2. Sessionmount 方法将确保首先匹配最长的前缀。它的实现非常有趣,所以我把它贴在这里。

    def mount(self, prefix, adapter):
        """Registers a connection adapter to a prefix.
        Adapters are sorted in descending order by key length."""
        self.adapters[prefix] = adapter
        keys_to_move = [k for k in self.adapters if len(k) < len(prefix)]
        for key in keys_to_move:
            self.adapters[key] = self.adapters.pop(key)
    

    请注意 self.adapters 是一个 OrderedDict

感谢@laike9m 现有的问答和文章,但现有的答案没有提到pool_maxsize的微妙之处及其与多线程代码的关系。

总结

  • pool_connections 是在给定时间从一个(主机、端口、方案)端点可以在池中保持活动的连接数。如果要在池中保留最多 n 个打开的 TCP 连接以供 Session 重用,则需要 pool_connections=n.
  • pool_maxsize 实际上与 requests 的用户无关,因为 pool_block(在 requests.adapters.HTTPAdapter 中)的默认值为 False 而不是 True

详情

正如此处正确指出的那样,pool_connections 是给定适配器前缀的最大打开连接数。最好通过示例来说明:

>>> import requests
>>> from requests.adapters import HTTPAdapter
>>> 
>>> from urllib3 import add_stderr_logger
>>> 
>>> add_stderr_logger()  # Turn on requests.packages.urllib3 logging
2018-12-21 20:44:03,979 DEBUG Added a stderr logging handler to logger: urllib3
<StreamHandler <stderr> (NOTSET)>
>>> 
>>> s = requests.Session()
>>> s.mount('https://', HTTPAdapter(pool_connections=1))
>>> 
>>> # 4 consecutive requests to (github.com, 443, https)
... # A new HTTPS (TCP) connection will be established only on the first conn.
... s.get('https://github.com/requests/requests/blob/master/requests/adapters.py')
2018-12-21 20:44:03,982 DEBUG Starting new HTTPS connection (1): github.com:443
2018-12-21 20:44:04,381 DEBUG https://github.com:443 "GET /requests/requests/blob/master/requests/adapters.py HTTP/1.1" 200 None
<Response [200]>
>>> s.get('https://github.com/requests/requests/blob/master/requests/packages.py')
2018-12-21 20:44:04,548 DEBUG https://github.com:443 "GET /requests/requests/blob/master/requests/packages.py HTTP/1.1" 200 None
<Response [200]>
>>> s.get('https://github.com/urllib3/urllib3/blob/master/src/urllib3/__init__.py')
2018-12-21 20:44:04,881 DEBUG https://github.com:443 "GET /urllib3/urllib3/blob/master/src/urllib3/__init__.py HTTP/1.1" 200 None
<Response [200]>
>>> s.get('https://github.com/python/cpython/blob/master/Lib/logging/__init__.py')
2018-12-21 20:44:06,533 DEBUG https://github.com:443 "GET /python/cpython/blob/master/Lib/logging/__init__.py HTTP/1.1" 200 None
<Response [200]>

以上,最大连接数为1;它是 (github.com, 443, https)。如果您想从新的(主机、端口、方案)三元组请求资源,Session 内部将转储现有连接以为新连接腾出空间:

>>> s.get('https://www.rfc-editor.org/info/rfc4045')
2018-12-21 20:46:11,340 DEBUG Starting new HTTPS connection (1): www.rfc-editor.org:443
2018-12-21 20:46:12,185 DEBUG https://www.rfc-editor.org:443 "GET /info/rfc4045 HTTP/1.1" 200 6707
<Response [200]>
>>> s.get('https://www.rfc-editor.org/info/rfc4046')
2018-12-21 20:46:12,667 DEBUG https://www.rfc-editor.org:443 "GET /info/rfc4046 HTTP/1.1" 200 6862
<Response [200]>
>>> s.get('https://www.rfc-editor.org/info/rfc4047')
2018-12-21 20:46:13,837 DEBUG https://www.rfc-editor.org:443 "GET /info/rfc4047 HTTP/1.1" 200 6762
<Response [200]>

您可以将数字增加到 pool_connections=2,然后在 3 个独特的主机组合之间循环,您会看到同样的事情在发生。 (要注意的另一件事是会话将以同样的方式保留和发回 cookie。)

现在 pool_maxsize,传递给 urllib3.poolmanager.PoolManager,最终传递给 urllib3.connectionpool.HTTPSConnectionPool。 maxsize 的文档字符串是:

Number of connections to save that can be reused. More than 1 is useful in multithreaded situations. If block is set to False, more connections will be created but they will not be saved once they've been used.

顺便说一句,block=FalseHTTPAdapter 的默认值,尽管 HTTPConnectionPool 的默认值是 True。这意味着 pool_maxsizeHTTPAdapter.

几乎没有影响

此外,requests.Session()不是线程安全的;您不应该使用来自多个线程的相同 session 实例。 (参见 here and here.) If you really want to, the safer way to go would be to lend each thread its own localized session instance, then use that session to make requests over multiple URLs, via threading.local():

import threading
import requests

local = threading.local()  # values will be different for separate threads.

vars(local)  # initially empty; a blank class with no attrs.


def get_or_make_session(**adapter_kwargs):
    # `local` will effectively vary based on the thread that is calling it
    print('get_or_make_session() called from id:', threading.get_ident())

    if not hasattr(local, 'session'):
        session = requests.Session()
        adapter = requests.adapters.HTTPAdapter(**kwargs)
        session.mount('http://', adapter)
        session.mount('https://', adapter)
        local.session = session
    return local.session