在 Google App Engine (python) 从 Google 云存储读取文件时发生内存泄漏

Question

以下是 Google App Engine 的 python 代码运行的一部分。它使用 cloudstorage 客户端从 Google 云存储中获取文件。

问题是代码每次读取一个大文件（10M左右），实例使用的内存会线性增加。很快，进程因 "Exceeded soft private memory limit of 128 MB with 134 MB after servicing 40 requests total".

而终止

class ReadGSFile(webapp2.RequestHandler):
    def get(self):
        import cloudstorage as gcs

        self.response.headers['Content-Type'] = "file type"
        read_path = "path/to/file"

        with gcs.open(read_path, 'r') as fp:
            buf = fp.read(1000000)
            while buf:
                self.response.out.write(buf)
                buf = fp.read(1000000)
            fp.close()

如果我注释掉以下行，那么实例中的内存使用确实会发生变化。所以应该是webapp2的问题。

  self.response.out.write(buf)

假设webapp2响应完成后会释放内存space。但在我的代码中，它没有。

Answer 1

尝试清除上下文缓存。

from google.appengine.ext import ndb

context = ndb.get_context()
context.clear_cache()

See documentation here

With executing long-running queries in background tasks, it's possible for the in-context cache to consume large amounts of memory. This is because the cache keeps a copy of every entity that is retrieved or stored in the current context. To avoid memory exceptions in long-running tasks, you can disable the cache or set a policy that excludes whichever entities are consuming the most memory.

您也可以尝试清除 webapp2 响应对象缓冲区。在while loop

之前插入这行代码

self.response.clear()

The response buffers all output in memory, then sends the final output when the handler exits. webapp2 does not support streaming data to the client. The clear() method erases the contents of the output buffer, leaving it empty.

Check this link

Answer 2

根据上述用户voscausa的评论建议，我更改了文件下载的方案，即使用Blobstore提供文件下载服务。现在内存泄漏的问题已经解决了

参考：https://cloud.google.com/appengine/docs/python/blobstore/#Python_Using_the_Blobstore_API_with_Google_Cloud_Storage

from google.appengine.ext import blobstore
from google.appengine.ext.webapp import blobstore_handlers

class GCSServingHandler(blobstore_handlers.BlobstoreDownloadHandler):
  def get(self):
    read_path = "/path/to/gcs file/"  # The leading chars should not be "/gs/"
    blob_key  = blobstore.create_gs_key("/gs/" + read_path)

    f_name = "file name"
    f_type = "file type" # Such as 'text/plain'

    self.response.headers['Content-Type'] = f_type
    self.response.headers['Content-Disposition'] = "attachment; filename=\"%s\";"%f_name
    self.response.headers['Content-Disposition'] += " filename*=utf-8''" + urllib2.quote(f_name.encode("utf8"))

    self.send_blob(blob_key)

Answer 3

我遇到过类似的问题。在我的代码中，我依次下载了很多 1-10MB 的文件，对所有这些文件进行了一些处理，然后将结果发布到云端。

我亲眼目睹了严重的内存泄漏，无法连续处理超过 50-100 次下载。

不愿意将下载代码重写到 Blobstore 我尝试了最后的实验，每次下载后手动调用垃圾收集：

import gc
gc.collect()

我现在运行几分钟的代码没有任何 "Exceeded soft private memory limit " 并且实例的内存占用似乎以更慢的速度增加。

显然，这可能只是运气好，占用空间仍在逐渐增加，但也有一些下降，实例已经处理了 2000 个请求。

在 Google App Engine (python) 从 Google 云存储读取文件时发生内存泄漏

Memory leak in reading files from Google Cloud Storage at Google App Engine (python)

python

google-app-engine

memory-leaks

webapp2

google-cloud-storage