使用 Google Cloud Datastore Python 库时，我应该如何调查内存泄漏？

Question

我有一个使用 Google 数据存储的 Web 应用程序，在收到足够多的请求后运行内存不足。

我已将其缩小为数据存储区查询。下面提供了一个最小的 PoC，slightly longer version 包括内存测量在 Github.

from google.cloud import datastore
from google.oauth2 import service_account

def test_datastore(entity_type: str) -> list:
    creds = service_account.Credentials.from_service_account_file("/path/to/creds")
    client = datastore.Client(credentials=creds, project="my-project")
    query = client.query(kind=entity_type, namespace="my-namespace")
    query.keys_only()
    for result in query.fetch(1):
        print(f"[+] Got a result: {result}")

for n in range(0,100):
    test_datastore("my-entity-type")

分析过程 RSS 显示每次迭代大约增长 1 MiB。即使没有返回结果，也会发生这种情况。以下是我的 Github 要点的输出：

[+] Iteration 0, memory usage 38.9 MiB bytes
[+] Iteration 1, memory usage 45.9 MiB bytes
[+] Iteration 2, memory usage 46.8 MiB bytes
[+] Iteration 3, memory usage 47.6 MiB bytes
..
[+] Iteration 98, memory usage 136.3 MiB bytes
[+] Iteration 99, memory usage 137.1 MiB bytes

但同时，Python的mprof显示了一个平面图（运行像mprof run python datastore_test.py）：

问题

我调用 Datastore 的方式是否有问题，或者这可能是库的潜在问题？

环境是 Python 3.7.4 on Windows 10（也在 Docker 的 Debian 3.8 上测试过），google-cloud-datastore==1.11.0 和 grpcio==1.28.1。

编辑 1

澄清这不是典型的 Python 分配器行为，它从 OS 请求内存但不会立即从内部区域/池中释放它。下面是来自 Kubernetes 的图表，其中我受影响的应用程序运行s:

这表明：

内存线性增长直到大约 2GiB，应用程序实际上因为内存不足而崩溃（技术上 Kubernetes 驱逐了 pod，但这与此处无关）。
正在使用相同的 Web 应用程序运行，但未与 GCP 存储或数据存储交互。
仅添加了与 GCP 存储的交互（随着时间的推移非常轻微的增长，可能是正常的）。
仅添加了与 GCP 数据存储的交互（更大的内存增长，大约 512MiB 一小时）。 Datastore 查询与 post.

编辑 2

为了绝对确定 Python 的内存使用情况，我使用 gc 检查了垃圾收集器的状态。退出前，程序报告：

gc: done, 15966 unreachable, 0 uncollectable, 0.0156s elapsed

我还在循环的每次迭代期间使用 gc.collect() 手动强制垃圾收集，这没有任何区别。

由于没有不可收集的对象，内存泄漏似乎不太可能来自使用 Python 的内部内存管理分配的对象。因此，更有可能是外部 C 库正在泄漏内存。

可能相关

有一个 open grpc issue 我不能确定是否相关，但与我的问题有很多相似之处。

Answer 1

我已将内存泄漏缩小到 datastore.Client 对象的创建。

对于以下概念验证代码，内存使用量不会增加：

from google.cloud import datastore
from google.oauth2 import service_account

def test_datastore(client, entity_type: str) -> list:
    query = client.query(kind=entity_type, namespace="my-namespace")
    query.keys_only()
    for result in query.fetch(1):
        print(f"[+] Got a result: {result}")

creds = service_account.Credentials.from_service_account_file("/path/to/creds")
client = datastore.Client(credentials=creds, project="my-project")

for n in range(0,100):
    test_datastore(client, "my-entity-type")

这对于 client 对象可以创建一次并在请求之间安全共享的小脚本来说是有意义的。

在许多其他应用程序中，很难（或不可能）安全地绕过客户端对象。我希望库在客户端超出范围时释放内存，否则这个问题可能会出现在任何长运行程序中。

编辑 1

我已将范围缩小到 grpc。环境变量 GOOGLE_CLOUD_DISABLE_GRPC 可以设置（任意值）以禁用 grpc。

设置完成后，我在 Kubernetes 中的应用程序如下所示：

对 valgrind 的进一步调查表明它可能与 grpc 中的 OpenSSL 使用有关，我在错误跟踪器的 this ticket 中记录了这一点。

使用 Google Cloud Datastore Python 库时，我应该如何调查内存泄漏？

How should I investigate a memory leak when using Google Cloud Datastore Python libraries?

python

memory-leaks

google-cloud-datastore