Tensorflow 服务连接无响应中止

Tensorflow Serving connection aborts without response

我有一个基本的 tensorflow 服务 docker 容器在 kubernetes pod 上公开模型。

FROM tensorflow/serving:2.6.0

RUN mkdir /serving_model
WORKDIR /serving_model
COPY src/serving_model /serving_model

EXPOSE 5225 #(5225 is the port all the pods talk to each other on)

ENTRYPOINT tensorflow_model_server --rest_api_port=5225 --model_name=MyModel --model_base_path=/serving_model/

它被另一个 pod 上的 python 服务 运行 调用。

  def call_tensorflow_serving(self, docker_pod_url: str, input: dict) -> Response:
      response = requests.post(
          f"{docker_pod_url}/v1/models/MyModel:predict",
          data=json.dumps(input),
      )
      return response

偶尔,这实际上会成功,但大多数情况下 python 服务无法从 tensorflow 服务中检索响应,并出现以下错误:

Traceback (most recent call last):
  File "/python-dependencies/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/python-dependencies/lib/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/python-dependencies/lib/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/usr/local/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.8/http/client.py", line 285, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

tensorflow serving container好像超时关闭连接了,请问有什么方法可以延长完成请求的时间吗?

此外,一旦启动完成,tensorflow serving pod 的日志不会显示任何内容:

2022-03-01 16:03:00.345546: I tensorflow_serving/model_server/server.cc:89] Building single TensorFlow model file config:  model_name: MyModel model_base_path: /serving_model/
2022-03-01 16:03:00.348593: I tensorflow_serving/model_server/server_core.cc:465] Adding/updating models.
2022-03-01 16:03:00.348622: I tensorflow_serving/model_server/server_core.cc:591]  (Re-)adding model: MyModel
2022-03-01 16:03:00.449013: I tensorflow_serving/core/basic_manager.cc:740] Successfully reserved resources to load servable {name: MyModel version: 6}
2022-03-01 16:03:00.449051: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: MyModel version: 6}
2022-03-01 16:03:00.449064: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: MyModel version: 6}
2022-03-01 16:03:00.449114: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:38] Reading SavedModel from: /serving_model/6
2022-03-01 16:03:01.418230: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:90] Reading meta graph with tags { serve }
2022-03-01 16:03:01.418305: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:132] Reading SavedModel debug info (if present) from: /serving_model/6
2022-03-01 16:03:01.418961: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-03-01 16:03:04.924449: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:211] Restoring SavedModel bundle.
2022-03-01 16:03:08.716223: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:195] Running initialization op on SavedModel bundle at path: /serving_model/6
2022-03-01 16:03:11.024820: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:283] SavedModel load for tags { serve }; Status: success: OK. Took 10573787 microseconds.
2022-03-01 16:03:11.321916: I tensorflow_serving/servables/tensorflow/saved_model_warmup_util.cc:59] No warmup data file found at /serving_model/6/assets.extra/tf_serving_warmup_requests
2022-03-01 16:03:11.816916: I tensorflow_serving/core/loader_harness.cc:87] Successfully loaded servable version {name: MyModel version: 6}
2022-03-01 16:03:11.820605: I tensorflow_serving/model_servers/server_core.cc:486] Finished adding/updating models
2022-03-01 16:03:11.824554: I tensorflow_serving/model_server/server.cc:133] Using InsecureServerCredentials
2022-03-01 16:03:11.824604: I tensorflow_serving/model_server/server.cc:383] Profiler service is enabled
2022-03-01 16:03:11.840760: I tensorflow_serving/model_server/server.cc:409] Running gRPC ModelServer at 0.0.0.0:8500 ...
[warn] getaddrinfo: address family for nodename not supported
2022-03-01 16:03:11.856959: I tensorflow_serving/model_server/server.cc:430] Exporting HTTP/REST API at:localhost:5225 ...
[evhttp_server.cc : 245] NET_LOG: Entering the event loop ...

是否可以对其进行配置以获取更多信息?

============================================= =================== 附加信息

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: tensorflow-server
      name: tensorflow-server
      namespace: app-namespace
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: tensorflow-server
      strategy:
        rollingUpdate:
          maxSurge: 50%
          maxUnavailable: 50%
        type: RollingUpdate
      template:
        metadata:
          labels:
            app: tensorflow-server
          name: tensorflow-server
          annotations:
            prometheus.io/scrape: "true"
            prometheus.io/port: "5225"
        spec:
          containers:
            - image: ...
              imagePullPolicy: ...
              name: tensorflow-server
              resources:
                limits:
                  cpu: "100m"
                  memory: "256Mi"
                requests:
                  cpu: "100m"
                  memory: "256Mi"

我终于当场抓住了豆荚。有一小会儿 tensorflow-predictor 报告自己为“已杀”,然后悄无声息地重生。原来 pod 没有足够的内存,所以只要实际查询触发它,容器就会按照 的描述终止 tensorflow-predictor。