模型部署后调用端点：[Err 104] Connection reset by peer

Question

我是 Sagemaker 的新手。我已经使用 Json 和权重文件在 tensorflow 中部署了训练有素的模型。但奇怪的是，在我的笔记本上，我没有看到写着"Endpoint successfully built"。仅显示以下内容：

--------------------------------------------------------------------------------!

相反，我从控制台找到了端点编号。

import sagemaker
from sagemaker.tensorflow.model import TensorFlowModel
        predictor=sagemaker.tensorflow.model.TensorFlowPredictor(endpoint_name, sagemaker_session)
data= test_out2
predictor.predict(data)

然后我尝试使用二维数组调用端点： (1) 如果我的二维数组的大小为 (5000, 170)，我得到错误：

ConnectionResetError: [Errno 104] Connection reset by peer

(2) 如果将数组缩小到 (10,170)，错误是：

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (500) from model with message "". See https://us-east-2.console.aws.amazon.com/cloudwatch/home?region=us-east-2#logEventViewer:group=/aws/sagemaker/Endpoints/sagemaker-tensorflow-2019-04-28-XXXXXXXXX in account 15XXXXXXXX for more information.

有什么建议吗？在 github、https://github.com/awslabs/amazon-sagemaker-examples/issues/589、

中发现类似案例

请问是类似的情况吗？

非常感谢您！

Answer 1

数据大小 (5000, 170) 的第一个错误可能是由于容量问题。 SageMaker 端点预测的大小限制为 5mb。因此，如果你的数据大于 5mb，你需要将其切碎并多次调用预测。

对于数据大小为 (10, 170) 的第二个错误，错误消息要求您查看日志。您在 cloudwatch 日志中发现任何有趣的内容了吗？这个问题有什么可以分享的吗？

Answer 2

我遇到了这个问题，这个 post 帮我解决了。预测器将采用的数据集的大小似乎确实存在限制。我不确定它是什么，但无论如何我现在以不同的方式拆分我的 training/test 数据。

我假设存在限制，并且该限制基于原始数据量。粗略地说，这将转化为我的数据框中的单元格数量，因为每个单元格可能是整数或浮点数。

如果我能得到 70%/30% 的拆分，我会使用它，但如果 30% 的测试数据超过最大单元格数，我会拆分我的数据以提供适合的最大行数最大值。

拆分代码如下：

# Check that the test data isn't too big for the predictor
max_test_cells = 200000
model_rows, model_cols = model_data.shape
print('model_data.shape=', model_data.shape)
max_test_rows = int(max_test_cells / model_cols)
print('max_test_rows=', max_test_rows)
test_rows = min(int(0.3 * len(model_data)), max_test_rows)
print('actual_test_rows=', test_rows)
training_rows = model_rows - test_rows
print('training_rows=', training_rows)

# Split the data to get the largest test set possible
train_data, test_data = np.split(model_data.sample(frac=1, random_state=1729), [training_rows])
print(train_data.shape, test_data.shape)

模型部署后调用端点：[Err 104] Connection reset by peer

Invoke endpoint after model deployment : [Err 104] Connection reset by peer

tensorflow

amazon-sagemaker