在 Sagemaker 上使用 TensorFlow 进行训练没有名为 'tf_container' 的模块
Training with TensorFlow on Sagemaker No module named 'tf_container'
我正在尝试在 AWS Sagemaker 上训练 TensorFlow 模型。
我为此创建了带有外部库的容器(使用您自己的算法或模型与 Amazon SageMaker)。
我们 运行 使用 TensorFlow 的训练工作 API
from sagemaker.tensorflow import TensorFlow
estimator = TensorFlow(
entry_point="entry.py", # entry script
role=role,
framework_version="1.13.0",
py_version='py3',
hyperparameters=hyperparameters,
train_instance_count=1, # "The number of GPUs instances to use"
train_instance_type=train_instance_type,
image_name=my_image
)
estimator.fit({'train': train_s3, 'eval': eval_s3})
出现错误:
09:06:46
2019-07-23 09:06:45,463 INFO - root - running container entrypoint
09:06:46
2019-07-23 09:06:45,463 INFO - root - starting train task
09:06:46
2019-07-23 09:06:45,476 INFO - container_support.training - Training starting
09:06:46
2019-07-23 09:06:45,479 ERROR - container_support.training - uncaught exception during training: No module named 'tf_container'
09:06:46
Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/container_support/environment.py", line 136, in load_framework return importlib.import_module('mxnet_container') File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 994, in _gcd_i
09:06:46
ModuleNotFoundError: No module named 'mxnet_container'
09:06:46
During handling of the above exception, another exception occurred:
09:06:46
Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/container_support/training.py", line 35, in start fw = TrainingEnvironment.load_framework() File "/usr/local/lib/python3.6/dist-packages/container_support/environment.py", line 138, in load_framework return importlib.import_module('tf_container') File "/usr/lib/python3.6/importlib/__init__.py", line 126,
09:06:46
ModuleNotFoundError: No module named 'tf_container'
我该怎么做才能解决这个问题?我该如何调试这种情况?
我猜您使用的是自己的 TF 容器,而不是 https://github.com/aws/sagemaker-tensorflow-container
中的 SageMaker
如果是这种情况,您的容器缺少使用 TensorFlow 估算器('tf_container' 包)所需的支持代码。
解决方案是从 SageMaker 容器开始,对其进行自定义,将其推回 ECR,并将图像名称通过 'image_name' 参数传递给 SageMaker 估计器。
我正在尝试在 AWS Sagemaker 上训练 TensorFlow 模型。 我为此创建了带有外部库的容器(使用您自己的算法或模型与 Amazon SageMaker)。
我们 运行 使用 TensorFlow 的训练工作 API
from sagemaker.tensorflow import TensorFlow
estimator = TensorFlow(
entry_point="entry.py", # entry script
role=role,
framework_version="1.13.0",
py_version='py3',
hyperparameters=hyperparameters,
train_instance_count=1, # "The number of GPUs instances to use"
train_instance_type=train_instance_type,
image_name=my_image
)
estimator.fit({'train': train_s3, 'eval': eval_s3})
出现错误:
09:06:46
2019-07-23 09:06:45,463 INFO - root - running container entrypoint
09:06:46
2019-07-23 09:06:45,463 INFO - root - starting train task
09:06:46
2019-07-23 09:06:45,476 INFO - container_support.training - Training starting
09:06:46
2019-07-23 09:06:45,479 ERROR - container_support.training - uncaught exception during training: No module named 'tf_container'
09:06:46
Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/container_support/environment.py", line 136, in load_framework return importlib.import_module('mxnet_container') File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 994, in _gcd_i
09:06:46
ModuleNotFoundError: No module named 'mxnet_container'
09:06:46
During handling of the above exception, another exception occurred:
09:06:46
Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/container_support/training.py", line 35, in start fw = TrainingEnvironment.load_framework() File "/usr/local/lib/python3.6/dist-packages/container_support/environment.py", line 138, in load_framework return importlib.import_module('tf_container') File "/usr/lib/python3.6/importlib/__init__.py", line 126,
09:06:46
ModuleNotFoundError: No module named 'tf_container'
我该怎么做才能解决这个问题?我该如何调试这种情况?
我猜您使用的是自己的 TF 容器,而不是 https://github.com/aws/sagemaker-tensorflow-container
中的 SageMaker如果是这种情况,您的容器缺少使用 TensorFlow 估算器('tf_container' 包)所需的支持代码。
解决方案是从 SageMaker 容器开始,对其进行自定义,将其推回 ECR,并将图像名称通过 'image_name' 参数传递给 SageMaker 估计器。