AWS Sagemaker KeyError: 'SM_CHANNEL_TRAINING' when tuning hyperparameters

AWS Sagemaker KeyError: 'SM_CHANNEL_TRAINING' when tuning hyperparameters

当我尝试在 Sagemaker 上使用超参数调整时出现此错误:

UnexpectedStatusException: Error for HyperParameterTuning job imageclassif-job-10-21-47-43: Failed. Reason: No training job succeeded after 5 attempts. Please take a look at the training job failures to get more details.

当我在 CloudWatch 上查找日志时,所有 5 个失败的训练作业最后都有相同的错误:

Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/ml/code/train.py", line 117, in <module>
    parser.add_argument('--data-dir', type=str, default=os.environ['SM_CHANNEL_TRAINING'])
  File "/usr/lib/python3.5/os.py", line 725, in __getitem__
    raise KeyError(key) from None

KeyError: 'SM_CHANNEL_TRAINING'

问题出在项目的第4步:https://github.com/petrooha/Deploying-LSTM/blob/main/SageMaker%20Project.ipynb

非常感谢关于下一步去哪里的任何提示

在您的 train.py 文件中,将环境变量从

parser.add_argument('--data-dir', type=str, default=os.environ['SM_CHANNEL_TRAINING'])

parser.add_argument('--data-dir', type=str, default=os.environ['SM_CHANNEL_TRAIN']) 应该可以解决这个问题。

Torch 的 framework_version 1.3.1 就是这种情况,但其他版本也可能会受到影响。这是link供您参考。