AWS Sagemaker KeyError: 'SM_CHANNEL_TRAINING' when tuning hyperparameters
AWS Sagemaker KeyError: 'SM_CHANNEL_TRAINING' when tuning hyperparameters
当我尝试在 Sagemaker 上使用超参数调整时出现此错误:
UnexpectedStatusException: Error for HyperParameterTuning job imageclassif-job-10-21-47-43: Failed. Reason: No training job succeeded after 5 attempts. Please take a look at the training job failures to get more details.
当我在 CloudWatch 上查找日志时,所有 5 个失败的训练作业最后都有相同的错误:
Traceback (most recent call last):
File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/ml/code/train.py", line 117, in <module>
parser.add_argument('--data-dir', type=str, default=os.environ['SM_CHANNEL_TRAINING'])
File "/usr/lib/python3.5/os.py", line 725, in __getitem__
raise KeyError(key) from None
和
KeyError: 'SM_CHANNEL_TRAINING'
问题出在项目的第4步:https://github.com/petrooha/Deploying-LSTM/blob/main/SageMaker%20Project.ipynb
非常感谢关于下一步去哪里的任何提示
在您的 train.py
文件中,将环境变量从
parser.add_argument('--data-dir', type=str, default=os.environ['SM_CHANNEL_TRAINING'])
到
parser.add_argument('--data-dir', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
应该可以解决这个问题。
Torch 的 framework_version 1.3.1 就是这种情况,但其他版本也可能会受到影响。这是link供您参考。
当我尝试在 Sagemaker 上使用超参数调整时出现此错误:
UnexpectedStatusException: Error for HyperParameterTuning job imageclassif-job-10-21-47-43: Failed. Reason: No training job succeeded after 5 attempts. Please take a look at the training job failures to get more details.
当我在 CloudWatch 上查找日志时,所有 5 个失败的训练作业最后都有相同的错误:
Traceback (most recent call last):
File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/ml/code/train.py", line 117, in <module>
parser.add_argument('--data-dir', type=str, default=os.environ['SM_CHANNEL_TRAINING'])
File "/usr/lib/python3.5/os.py", line 725, in __getitem__
raise KeyError(key) from None
和
KeyError: 'SM_CHANNEL_TRAINING'
问题出在项目的第4步:https://github.com/petrooha/Deploying-LSTM/blob/main/SageMaker%20Project.ipynb
非常感谢关于下一步去哪里的任何提示
在您的 train.py
文件中,将环境变量从
parser.add_argument('--data-dir', type=str, default=os.environ['SM_CHANNEL_TRAINING'])
到
parser.add_argument('--data-dir', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
应该可以解决这个问题。
Torch 的 framework_version 1.3.1 就是这种情况,但其他版本也可能会受到影响。这是link供您参考。