Google AI Platform 上的 TensorFlow 模型在线预测实例批次太慢

Question

我正在尝试将 TensorFlow 模型部署到 Google AI 在线预测平台。我遇到 延迟和吞吐量问题。

对于单个图像，该模型在我的机器上运行不到 1 秒（只有 Intel Core I7 4790K CPU）。我将它部署到一台 8 核和 NVIDIA T4 GPU 机器上的 AI Platform。

当运行 AI Platform 上的模型在上述配置上时，仅发送一张图像需要不到一秒的时间。如果我开始发送许多请求，每个请求带有一张图片，模型最终会阻塞并停止响应。因此，我改为针对每个请求发送一批图像（从 2 到 10，取决于外部因素）。

问题是我预计批处理的请求在时间上几乎是恒定的。发送 1 张图像时，CPU 利用率约为 10%，GPU 为 12%。所以我预计一批 9 张图像将使用 ~100% 的硬件并同时响应 ~1 秒，但事实并非如此。处理一批 7 到 10 张图像需要 15 到 50 秒。

我已经尝试优化我的模型。我正在使用 map_fn，用手动循环替换它，从 Float 32 切换到 Float 16，尝试尽可能地向量化操作，但它仍然处于相同的情况。

我在这里错过了什么？

我正在使用最新的 AI Platform 运行时进行在线预测（Python 3.7、TensorFlow 2.1、CUDA 10.1）。

该模型是 YOLOv4 的大型版本（SavedModel 格式约为 250MB）。我在 TensorFlow 中构建了一些对模型输出进行操作的后处理算法。

最后但同样重要的是，我还尝试使用 TensorBoard 进行调试，结果发现 TensorFlow Graph 的 YOLOv4 部分占用了大约 90% 的处理时间。我希望模型的这个特定部分是高度并行的。

提前感谢您对此提供的任何帮助。请向我询问您可能需要的任何信息，以便更好地理解该问题。

更新 2020-07-13： 正如下面评论中所建议的，我也在 CPU 上尝试了运行模型，但它是真的慢，并且遇到与 GPU 相同的问题。它似乎无法并行处理来自单个请求的图像。

此外，由于请求的速率和数量，我认为我运行遇到了 TensorFlow Serving 的问题。我在本地使用 tensorflow/serving:latest-gpu Docker 图像来进一步测试。该模型在我的机器（GeForce GTX 1650）上的回答速度比在 AI 平台上快 3 倍，但它确实与响应时间不一致。我得到以下响应时间 (<amount of images> <response time in milliseconds>)：

然后，运行一分钟后，我开始出现延迟和错误：

3 27578
3 28563
3 31867
3 18855
{
  message: 'Request failed with status code 504',
  response: {
    data: { error: 'Timed out waiting for notification' },
    status: 504
  }
}

Answer 1

来自 Google 云端 documentation:

If you use a simple model and a small set of input instances, you'll find that there is a considerable difference between how long it takes to finish identical prediction requests using online versus batch prediction. It might take a batch job several minutes to complete predictions that are returned almost instantly by an online request. This is a side-effect of the different infrastructure used by the two methods of prediction. AI Platform Prediction allocates and initializes resources for a batch prediction job when you send the request. Online prediction is typically ready to process at the time of request.

正如引用所说，这与节点分配的差异有关，特别是：

Node allocation for online prediction:

Keeps at least one node ready over a period of several minutes, to handle requests even when there are none to handle. The ready state ensures that the service can serve each prediction promptly.

您可以了解更多相关信息here

Answer 2

The model is a large version of YOLOv4 (~250MB in SavedModel format). I've built a few postprocessing algorithms in TensorFlow that operates on the output of the model.

您对 YOLOv4 进行了哪些后处理修改？放缓的根源是否可能来自这些操作？您可以在本地验证此假设的一项测试是，将未修改版本的 YOLOv4 与您已经为修改版本制定的基准进行基准测试。

Last but not least, I also tried debugging with TensorBoard, and it turns out that the YOLOv4 part of the TensorFlow Graph is taking ~90% of the processing time. I expected this particular part of the model to be highly parallel.

看看您在这里提到的“调试输出”会很有趣。如果使用 https://www.tensorflow.org/guide/profiler#install_the_profiler_and_gpu_prerequisites，最昂贵的操作细分是什么？我有一些深入研究 TF 操作的经验——在某些情况下，由于 CPU <-> GPU 数据传输瓶颈，我发现了一些奇怪的瓶颈。如果您向我发送 DM，我很乐意随时接听电话并与您一起看看。

Answer 3

使用AI平台遇到和我一样问题的朋友：

正如 Google 云团队的评论所述，AI Platform 不会一次执行批量实例。不过，他们计划添加该功能。

我们已经从 AI 平台转移到托管在 Google 云计算引擎上的 NVIDIA Triton 推理服务器的自定义部署。我们获得的性能比我们预期的要好得多，我们仍然可以对 Triton 提供的模型应用更多优化。

感谢所有试图通过回复此答案提供帮助的人。

Google AI Platform 上的 TensorFlow 模型在线预测实例批次太慢

TensorFlow model serving on Google AI Platform online prediction too slow with instance batches

google-cloud-platform

tensorflow

tensorflow-serving

google-cloud-ml

tensorflow2.x