在 AWS Sagemaker 中训练 keras 模型
Training keras model in AWS Sagemaker
我的机器上有 keras 训练脚本。我正在 AWS sagemaker 容器上试验 运行 我的脚本。为此,我使用了以下代码。
from sagemaker.tensorflow import TensorFlow
est = TensorFlow(
entry_point="caller.py",
source_dir="./",
role='role_arn',
framework_version="2.3.1",
py_version="py37",
instance_type='ml.m5.large',
instance_count=1,
hyperparameters={'batch': 8, 'epochs': 10},
)
est.fit()
这里caller.py
是我的切入点。执行上面的代码后,我得到 keras is not installed
。这是堆栈跟踪。
Traceback (most recent call last):
File "executor.py", line 14, in <module>
est.fit()
File "/home/thasin/Documents/python/venv/lib/python3.8/site-packages/sagemaker/estimator.py", line 682, in fit
self.latest_training_job.wait(logs=logs)
File "/home/thasin/Documents/python/venv/lib/python3.8/site-packages/sagemaker/estimator.py", line 1625, in wait
self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
File "/home/thasin/Documents/python/venv/lib/python3.8/site-packages/sagemaker/session.py", line 3681, in logs_for_job
self._check_job_status(job_name, description, "TrainingJobStatus")
File "/home/thasin/Documents/python/venv/lib/python3.8/site-packages/sagemaker/session.py", line 3240, in _check_job_status
raise exceptions.UnexpectedStatusException(
sagemaker.exceptions.UnexpectedStatusException: Error for Training job tensorflow-training-2021-06-09-07-14-01-778: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/usr/local/bin/python3.7 caller.py --batch 4 --epochs 10
ModuleNotFoundError: No module named 'keras'
- 哪个实例预装了keras?
- 有什么方法可以将 python 包安装到 AWS 容器中?或该问题的任何解决方法?
注意:我已经尝试用我自己的容器上传到 ECR 并成功 运行 我的代码。我正在寻找 AWS 现有的容器功能。
我的机器上有 keras 训练脚本。我正在 AWS sagemaker 容器上试验 运行 我的脚本。为此,我使用了以下代码。
from sagemaker.tensorflow import TensorFlow
est = TensorFlow(
entry_point="caller.py",
source_dir="./",
role='role_arn',
framework_version="2.3.1",
py_version="py37",
instance_type='ml.m5.large',
instance_count=1,
hyperparameters={'batch': 8, 'epochs': 10},
)
est.fit()
这里caller.py
是我的切入点。执行上面的代码后,我得到 keras is not installed
。这是堆栈跟踪。
Traceback (most recent call last):
File "executor.py", line 14, in <module>
est.fit()
File "/home/thasin/Documents/python/venv/lib/python3.8/site-packages/sagemaker/estimator.py", line 682, in fit
self.latest_training_job.wait(logs=logs)
File "/home/thasin/Documents/python/venv/lib/python3.8/site-packages/sagemaker/estimator.py", line 1625, in wait
self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
File "/home/thasin/Documents/python/venv/lib/python3.8/site-packages/sagemaker/session.py", line 3681, in logs_for_job
self._check_job_status(job_name, description, "TrainingJobStatus")
File "/home/thasin/Documents/python/venv/lib/python3.8/site-packages/sagemaker/session.py", line 3240, in _check_job_status
raise exceptions.UnexpectedStatusException(
sagemaker.exceptions.UnexpectedStatusException: Error for Training job tensorflow-training-2021-06-09-07-14-01-778: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/usr/local/bin/python3.7 caller.py --batch 4 --epochs 10
ModuleNotFoundError: No module named 'keras'
- 哪个实例预装了keras?
- 有什么方法可以将 python 包安装到 AWS 容器中?或该问题的任何解决方法?
注意:我已经尝试用我自己的容器上传到 ECR 并成功 运行 我的代码。我正在寻找 AWS 现有的容器功能。