Terraform Error: error waiting for sagemaker notebook instance to create: unexpected state 'Failed', wanted target 'InService'. last error: %!s(<nil>)

Terraform Error: error waiting for sagemaker notebook instance to create: unexpected state 'Failed', wanted target 'InService'. last error: %!s(<nil>)

this source code in my GitHub-repo (inspired by this tutorial and its related GitHub-repo 的 terraform-folder 中执行 terraform apply 后的整个错误消息):

aws_sagemaker_notebook_instance.notebook_instance: Creating...
aws_sagemaker_notebook_instance.notebook_instance: Still creating... [10s elapsed]
aws_sagemaker_notebook_instance.notebook_instance: Still creating... [20s elapsed]
...
aws_sagemaker_notebook_instance.notebook_instance: Still creating... [15m21s elapsed]
aws_sagemaker_notebook_instance.notebook_instance: Still creating... [15m31s elapsed]
╷
│ Error: error waiting for sagemaker notebook instance (aws-sm-notebook-instance) to create: unexpected state 'Failed', wanted target 'InService'. last error: %!s(<nil>)
│
│   with aws_sagemaker_notebook_instance.notebook_instance,
│   on notebook_instance.tf line 2, in resource "aws_sagemaker_notebook_instance" "notebook_instance":
│    2: resource "aws_sagemaker_notebook_instance" "notebook_instance" {
│

互联网研究似乎在 this article 中提供了解决方案,其启发是将 on-start.sh - 脚本中允许的 IDLE_TIME 增加到 IDLE_TIME=1800(以秒为单位,等于 30 分钟)。这对于大约 15 分钟的部署时间应该足够了;然而,它又抛出了同样的错误。

接下来,我发现this post on Whosebug建议

run terraform refresh, which will cause Terraform to refresh its state file against what actually exists with the cloud provider.

不幸的是,刷新后 运行 terraform apply 也没有解决问题。 我想知道为什么前面提到的 IDLE_TIME=1800 - 设置没有任何效果。这对于 15 分钟的应用时间应该绰绰有余。


编辑:添加代码细节以增强理解

1.创建 SageMaker 笔记本实例

resource "aws_sagemaker_notebook_instance" "notebook_instance" {
  name                    = "aws-sm-notebook-instance"
  role_arn                = aws_iam_role.notebook_iam_role.arn
  instance_type           = "ml.t2.medium"
  lifecycle_config_name   = aws_sagemaker_notebook_instance_lifecycle_configuration.notebook_config.name
  default_code_repository = aws_sagemaker_code_repository.git_repo.code_repository_name
}

2。定义 SageMaker 笔记本生命周期配置

resource "aws_sagemaker_notebook_instance_lifecycle_configuration" "notebook_config" {
  name      = "dev-platform-al-sm-lifecycle-config"
  on_create = filebase64("../scripts/on-create.sh")
  on_start  = filebase64("../scripts/on-start.sh")
}

3。定义要在 SageMaker 笔记本实例上实例化的 Git 存储库

resource "aws_sagemaker_code_repository" "git_repo" {
  code_repository_name = "aws-sm-notebook-instance-repo"

  git_config {
    repository_url = "https://github.com/AndreasLuckert/aws-sm-notebook-instance.git"
  }
}

on-start.sh的内容(包括IDLE_TIME - 参数) 请注意,此脚本将由 scripts/autostop.py - 脚本调用,您可以找到 here in the associated public repo containing the source code.

#!/bin/bash

set -e

## IDLE AUTOSTOP STEPS
## ----------------------------------------------------------------

## Setting the timeout (in seconds) for how long the SageMaker notebook can run idly before being auto-stopped
# -> e.g. 1800 s = 30 min since first deployment can take between 15 and 20 minutes which could then fail like so:
# "Error: error waiting for sagemaker notebook instance (aws-sm-notebook-instance) to create: unexpected state 'Failed', wanted target 'InService'. last error: %!s(<nil>)"
# Hint for solution under following link: https://yuyasugano.medium.com/machine-learning-infrastructure-terraforming-sagemaker-part-2-f2460a9a4663
IDLE_TIME=1800

# Getting the autostop.py script from GitHub
echo "Fetching the autostop script..."
wget https://raw.githubusercontent.com/andreasluckert/aws-sm-notebook-instance/main/scripts/autostop.py

# Using crontab to autostop the notebook when idle time is breached
echo "Starting the SageMaker autostop script in cron."
(crontab -l 2>/dev/null; echo "*/5 * * * * /usr/bin/python $PWD/autostop.py --time $IDLE_TIME --ignore-connections") | crontab -



## CUSTOM CONDA KERNEL USAGE STEPS
## ----------------------------------------------------------------

# Setting the proper user credentials
sudo -u ec2-user -i <<'EOF'
unset SUDO_UID

# Setting the source for the custom conda kernel
WORKING_DIR=/home/ec2-user/SageMaker/custom-miniconda
source "$WORKING_DIR/miniconda/bin/activate"

# Loading all the custom kernels
for env in $WORKING_DIR/miniconda/envs/*; do
    BASENAME=$(basename "$env")
    source activate "$BASENAME"
    python -m ipykernel install --user --name "$BASENAME" --display-name "Custom ($BASENAME)"
done

问题的解决方案是检查 CloudWatch -> Log groups -> /aws/sagemaker/NotebookInstances -> aws-sm-notebook-instance/LifecycleConfigOnCreate 下的 CloudWatch 日志事件以找到以下错误消息:

/bin/bash: /tmp/OnCreate_2021-09-08-12-24rw5al34g: /bin/bash^M: bad interpreter: No such file or directory

一些互联网研究让我找到 this solution related to newline characters in shell-scripts,这取决于您使用的是 Windows 还是 UNIX 系统。 当我在 Windows 上工作时,在 VS-Code 中创建的 shell 脚本包含特定于 dos 的 CRLF 换行符处理,可以通过右下角的按钮解决在 VS-Code 中将 carriage return (CRLF) 字符切换为 UNIX 使用的 line feed (LF) 字符.

由于 AWS Sagemaker 使用的计算实例是一个 Linux 系统,它无法处理 shell 脚本中的 dos 样式的 CRLF 换行符,这“添加”了一个 ^M/bin/bash 之后,这显然会导致错误,因为这样的解释器不存在。

所以,最后 terraform apply 结果很好:

$ terraform apply
...
...
aws_sagemaker_notebook_instance.notebook_instance: Still creating... [7m30s elapsed]
aws_sagemaker_notebook_instance.notebook_instance: Still creating... [7m40s elapsed]
aws_sagemaker_notebook_instance.notebook_instance: Creation complete after 7m43s [id=aws-sm-notebook-instance]

Apply complete! Resources: 1 added, 1 changed, 1 destroyed.