Tensorflow 服务连接无响应中止
Tensorflow Serving connection aborts without response
我有一个基本的 tensorflow 服务 docker 容器在 kubernetes pod 上公开模型。
FROM tensorflow/serving:2.6.0
RUN mkdir /serving_model
WORKDIR /serving_model
COPY src/serving_model /serving_model
EXPOSE 5225 #(5225 is the port all the pods talk to each other on)
ENTRYPOINT tensorflow_model_server --rest_api_port=5225 --model_name=MyModel --model_base_path=/serving_model/
它被另一个 pod 上的 python 服务 运行 调用。
def call_tensorflow_serving(self, docker_pod_url: str, input: dict) -> Response:
response = requests.post(
f"{docker_pod_url}/v1/models/MyModel:predict",
data=json.dumps(input),
)
return response
偶尔,这实际上会成功,但大多数情况下 python 服务无法从 tensorflow 服务中检索响应,并出现以下错误:
Traceback (most recent call last):
File "/python-dependencies/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
httplib_response = self._make_request(
File "/python-dependencies/lib/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/python-dependencies/lib/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
httplib_response = conn.getresponse()
File "/usr/local/lib/python3.8/http/client.py", line 1348, in getresponse
response.begin()
File "/usr/local/lib/python3.8/http/client.py", line 316, in begin
version, status, reason = self._read_status()
File "/usr/local/lib/python3.8/http/client.py", line 285, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
tensorflow serving container好像超时关闭连接了,请问有什么方法可以延长完成请求的时间吗?
此外,一旦启动完成,tensorflow serving pod 的日志不会显示任何内容:
2022-03-01 16:03:00.345546: I tensorflow_serving/model_server/server.cc:89] Building single TensorFlow model file config: model_name: MyModel model_base_path: /serving_model/
2022-03-01 16:03:00.348593: I tensorflow_serving/model_server/server_core.cc:465] Adding/updating models.
2022-03-01 16:03:00.348622: I tensorflow_serving/model_server/server_core.cc:591] (Re-)adding model: MyModel
2022-03-01 16:03:00.449013: I tensorflow_serving/core/basic_manager.cc:740] Successfully reserved resources to load servable {name: MyModel version: 6}
2022-03-01 16:03:00.449051: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: MyModel version: 6}
2022-03-01 16:03:00.449064: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: MyModel version: 6}
2022-03-01 16:03:00.449114: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:38] Reading SavedModel from: /serving_model/6
2022-03-01 16:03:01.418230: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:90] Reading meta graph with tags { serve }
2022-03-01 16:03:01.418305: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:132] Reading SavedModel debug info (if present) from: /serving_model/6
2022-03-01 16:03:01.418961: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-03-01 16:03:04.924449: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:211] Restoring SavedModel bundle.
2022-03-01 16:03:08.716223: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:195] Running initialization op on SavedModel bundle at path: /serving_model/6
2022-03-01 16:03:11.024820: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:283] SavedModel load for tags { serve }; Status: success: OK. Took 10573787 microseconds.
2022-03-01 16:03:11.321916: I tensorflow_serving/servables/tensorflow/saved_model_warmup_util.cc:59] No warmup data file found at /serving_model/6/assets.extra/tf_serving_warmup_requests
2022-03-01 16:03:11.816916: I tensorflow_serving/core/loader_harness.cc:87] Successfully loaded servable version {name: MyModel version: 6}
2022-03-01 16:03:11.820605: I tensorflow_serving/model_servers/server_core.cc:486] Finished adding/updating models
2022-03-01 16:03:11.824554: I tensorflow_serving/model_server/server.cc:133] Using InsecureServerCredentials
2022-03-01 16:03:11.824604: I tensorflow_serving/model_server/server.cc:383] Profiler service is enabled
2022-03-01 16:03:11.840760: I tensorflow_serving/model_server/server.cc:409] Running gRPC ModelServer at 0.0.0.0:8500 ...
[warn] getaddrinfo: address family for nodename not supported
2022-03-01 16:03:11.856959: I tensorflow_serving/model_server/server.cc:430] Exporting HTTP/REST API at:localhost:5225 ...
[evhttp_server.cc : 245] NET_LOG: Entering the event loop ...
是否可以对其进行配置以获取更多信息?
============================================= ===================
附加信息
- 集群类型:k3s
- kubernetes 版本:1.20
- tensorflow yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: tensorflow-server
name: tensorflow-server
namespace: app-namespace
spec:
replicas: 1
selector:
matchLabels:
app: tensorflow-server
strategy:
rollingUpdate:
maxSurge: 50%
maxUnavailable: 50%
type: RollingUpdate
template:
metadata:
labels:
app: tensorflow-server
name: tensorflow-server
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "5225"
spec:
containers:
- image: ...
imagePullPolicy: ...
name: tensorflow-server
resources:
limits:
cpu: "100m"
memory: "256Mi"
requests:
cpu: "100m"
memory: "256Mi"
我终于当场抓住了豆荚。有一小会儿 tensorflow-predictor 报告自己为“已杀”,然后悄无声息地重生。原来 pod 没有足够的内存,所以只要实际查询触发它,容器就会按照 的描述终止 tensorflow-predictor。
我有一个基本的 tensorflow 服务 docker 容器在 kubernetes pod 上公开模型。
FROM tensorflow/serving:2.6.0
RUN mkdir /serving_model
WORKDIR /serving_model
COPY src/serving_model /serving_model
EXPOSE 5225 #(5225 is the port all the pods talk to each other on)
ENTRYPOINT tensorflow_model_server --rest_api_port=5225 --model_name=MyModel --model_base_path=/serving_model/
它被另一个 pod 上的 python 服务 运行 调用。
def call_tensorflow_serving(self, docker_pod_url: str, input: dict) -> Response:
response = requests.post(
f"{docker_pod_url}/v1/models/MyModel:predict",
data=json.dumps(input),
)
return response
偶尔,这实际上会成功,但大多数情况下 python 服务无法从 tensorflow 服务中检索响应,并出现以下错误:
Traceback (most recent call last):
File "/python-dependencies/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
httplib_response = self._make_request(
File "/python-dependencies/lib/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/python-dependencies/lib/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
httplib_response = conn.getresponse()
File "/usr/local/lib/python3.8/http/client.py", line 1348, in getresponse
response.begin()
File "/usr/local/lib/python3.8/http/client.py", line 316, in begin
version, status, reason = self._read_status()
File "/usr/local/lib/python3.8/http/client.py", line 285, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
tensorflow serving container好像超时关闭连接了,请问有什么方法可以延长完成请求的时间吗?
此外,一旦启动完成,tensorflow serving pod 的日志不会显示任何内容:
2022-03-01 16:03:00.345546: I tensorflow_serving/model_server/server.cc:89] Building single TensorFlow model file config: model_name: MyModel model_base_path: /serving_model/
2022-03-01 16:03:00.348593: I tensorflow_serving/model_server/server_core.cc:465] Adding/updating models.
2022-03-01 16:03:00.348622: I tensorflow_serving/model_server/server_core.cc:591] (Re-)adding model: MyModel
2022-03-01 16:03:00.449013: I tensorflow_serving/core/basic_manager.cc:740] Successfully reserved resources to load servable {name: MyModel version: 6}
2022-03-01 16:03:00.449051: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: MyModel version: 6}
2022-03-01 16:03:00.449064: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: MyModel version: 6}
2022-03-01 16:03:00.449114: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:38] Reading SavedModel from: /serving_model/6
2022-03-01 16:03:01.418230: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:90] Reading meta graph with tags { serve }
2022-03-01 16:03:01.418305: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:132] Reading SavedModel debug info (if present) from: /serving_model/6
2022-03-01 16:03:01.418961: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-03-01 16:03:04.924449: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:211] Restoring SavedModel bundle.
2022-03-01 16:03:08.716223: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:195] Running initialization op on SavedModel bundle at path: /serving_model/6
2022-03-01 16:03:11.024820: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:283] SavedModel load for tags { serve }; Status: success: OK. Took 10573787 microseconds.
2022-03-01 16:03:11.321916: I tensorflow_serving/servables/tensorflow/saved_model_warmup_util.cc:59] No warmup data file found at /serving_model/6/assets.extra/tf_serving_warmup_requests
2022-03-01 16:03:11.816916: I tensorflow_serving/core/loader_harness.cc:87] Successfully loaded servable version {name: MyModel version: 6}
2022-03-01 16:03:11.820605: I tensorflow_serving/model_servers/server_core.cc:486] Finished adding/updating models
2022-03-01 16:03:11.824554: I tensorflow_serving/model_server/server.cc:133] Using InsecureServerCredentials
2022-03-01 16:03:11.824604: I tensorflow_serving/model_server/server.cc:383] Profiler service is enabled
2022-03-01 16:03:11.840760: I tensorflow_serving/model_server/server.cc:409] Running gRPC ModelServer at 0.0.0.0:8500 ...
[warn] getaddrinfo: address family for nodename not supported
2022-03-01 16:03:11.856959: I tensorflow_serving/model_server/server.cc:430] Exporting HTTP/REST API at:localhost:5225 ...
[evhttp_server.cc : 245] NET_LOG: Entering the event loop ...
是否可以对其进行配置以获取更多信息?
============================================= =================== 附加信息
- 集群类型:k3s
- kubernetes 版本:1.20
- tensorflow yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: tensorflow-server
name: tensorflow-server
namespace: app-namespace
spec:
replicas: 1
selector:
matchLabels:
app: tensorflow-server
strategy:
rollingUpdate:
maxSurge: 50%
maxUnavailable: 50%
type: RollingUpdate
template:
metadata:
labels:
app: tensorflow-server
name: tensorflow-server
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "5225"
spec:
containers:
- image: ...
imagePullPolicy: ...
name: tensorflow-server
resources:
limits:
cpu: "100m"
memory: "256Mi"
requests:
cpu: "100m"
memory: "256Mi"
我终于当场抓住了豆荚。有一小会儿 tensorflow-predictor 报告自己为“已杀”,然后悄无声息地重生。原来 pod 没有足够的内存,所以只要实际查询触发它,容器就会按照