Terraform Error: error waiting for sagemaker notebook instance to create: unexpected state 'Failed', wanted target 'InService'. last error: %!s(<nil>)
Terraform Error: error waiting for sagemaker notebook instance to create: unexpected state 'Failed', wanted target 'InService'. last error: %!s(<nil>)
在 this source code in my GitHub-repo (inspired by this tutorial and its related GitHub-repo 的 terraform-folder 中执行 terraform apply
后的整个错误消息):
aws_sagemaker_notebook_instance.notebook_instance: Creating...
aws_sagemaker_notebook_instance.notebook_instance: Still creating... [10s elapsed]
aws_sagemaker_notebook_instance.notebook_instance: Still creating... [20s elapsed]
...
aws_sagemaker_notebook_instance.notebook_instance: Still creating... [15m21s elapsed]
aws_sagemaker_notebook_instance.notebook_instance: Still creating... [15m31s elapsed]
╷
│ Error: error waiting for sagemaker notebook instance (aws-sm-notebook-instance) to create: unexpected state 'Failed', wanted target 'InService'. last error: %!s(<nil>)
│
│ with aws_sagemaker_notebook_instance.notebook_instance,
│ on notebook_instance.tf line 2, in resource "aws_sagemaker_notebook_instance" "notebook_instance":
│ 2: resource "aws_sagemaker_notebook_instance" "notebook_instance" {
│
互联网研究似乎在 this article 中提供了解决方案,其启发是将 on-start.sh
- 脚本中允许的 IDLE_TIME
增加到 IDLE_TIME=1800
(以秒为单位,等于 30 分钟)。这对于大约 15 分钟的部署时间应该足够了;然而,它又抛出了同样的错误。
接下来,我发现this post on Whosebug建议
run terraform refresh
, which will cause Terraform to refresh its state
file against what actually exists with the cloud provider.
不幸的是,刷新后 运行 terraform apply
也没有解决问题。
我想知道为什么前面提到的 IDLE_TIME=1800
- 设置没有任何效果。这对于 15 分钟的应用时间应该绰绰有余。
编辑:添加代码细节以增强理解
1.创建 SageMaker 笔记本实例
resource "aws_sagemaker_notebook_instance" "notebook_instance" {
name = "aws-sm-notebook-instance"
role_arn = aws_iam_role.notebook_iam_role.arn
instance_type = "ml.t2.medium"
lifecycle_config_name = aws_sagemaker_notebook_instance_lifecycle_configuration.notebook_config.name
default_code_repository = aws_sagemaker_code_repository.git_repo.code_repository_name
}
2。定义 SageMaker 笔记本生命周期配置
resource "aws_sagemaker_notebook_instance_lifecycle_configuration" "notebook_config" {
name = "dev-platform-al-sm-lifecycle-config"
on_create = filebase64("../scripts/on-create.sh")
on_start = filebase64("../scripts/on-start.sh")
}
3。定义要在 SageMaker 笔记本实例上实例化的 Git 存储库
resource "aws_sagemaker_code_repository" "git_repo" {
code_repository_name = "aws-sm-notebook-instance-repo"
git_config {
repository_url = "https://github.com/AndreasLuckert/aws-sm-notebook-instance.git"
}
}
on-start.sh
的内容(包括IDLE_TIME - 参数)
请注意,此脚本将由 scripts/autostop.py
- 脚本调用,您可以找到 here in the associated public repo containing the source code.
#!/bin/bash
set -e
## IDLE AUTOSTOP STEPS
## ----------------------------------------------------------------
## Setting the timeout (in seconds) for how long the SageMaker notebook can run idly before being auto-stopped
# -> e.g. 1800 s = 30 min since first deployment can take between 15 and 20 minutes which could then fail like so:
# "Error: error waiting for sagemaker notebook instance (aws-sm-notebook-instance) to create: unexpected state 'Failed', wanted target 'InService'. last error: %!s(<nil>)"
# Hint for solution under following link: https://yuyasugano.medium.com/machine-learning-infrastructure-terraforming-sagemaker-part-2-f2460a9a4663
IDLE_TIME=1800
# Getting the autostop.py script from GitHub
echo "Fetching the autostop script..."
wget https://raw.githubusercontent.com/andreasluckert/aws-sm-notebook-instance/main/scripts/autostop.py
# Using crontab to autostop the notebook when idle time is breached
echo "Starting the SageMaker autostop script in cron."
(crontab -l 2>/dev/null; echo "*/5 * * * * /usr/bin/python $PWD/autostop.py --time $IDLE_TIME --ignore-connections") | crontab -
## CUSTOM CONDA KERNEL USAGE STEPS
## ----------------------------------------------------------------
# Setting the proper user credentials
sudo -u ec2-user -i <<'EOF'
unset SUDO_UID
# Setting the source for the custom conda kernel
WORKING_DIR=/home/ec2-user/SageMaker/custom-miniconda
source "$WORKING_DIR/miniconda/bin/activate"
# Loading all the custom kernels
for env in $WORKING_DIR/miniconda/envs/*; do
BASENAME=$(basename "$env")
source activate "$BASENAME"
python -m ipykernel install --user --name "$BASENAME" --display-name "Custom ($BASENAME)"
done
问题的解决方案是检查 CloudWatch -> Log groups -> /aws/sagemaker/NotebookInstances -> aws-sm-notebook-instance/LifecycleConfigOnCreate
下的 CloudWatch 日志事件以找到以下错误消息:
/bin/bash: /tmp/OnCreate_2021-09-08-12-24rw5al34g: /bin/bash^M: bad interpreter: No such file or directory
一些互联网研究让我找到 this solution related to newline characters in shell-scripts,这取决于您使用的是 Windows
还是 UNIX
系统。
当我在 Windows 上工作时,在 VS-Code 中创建的 shell 脚本包含特定于 dos 的 CRLF
换行符处理,可以通过右下角的按钮解决在 VS-Code
中将 carriage return (CRLF) 字符切换为 UNIX 使用的 line feed (LF) 字符.
由于 AWS Sagemaker 使用的计算实例是一个 Linux 系统,它无法处理 shell 脚本中的 dos 样式的 CRLF 换行符,这“添加”了一个 ^M
在 /bin/bash
之后,这显然会导致错误,因为这样的解释器不存在。
所以,最后 terraform apply
结果很好:
$ terraform apply
...
...
aws_sagemaker_notebook_instance.notebook_instance: Still creating... [7m30s elapsed]
aws_sagemaker_notebook_instance.notebook_instance: Still creating... [7m40s elapsed]
aws_sagemaker_notebook_instance.notebook_instance: Creation complete after 7m43s [id=aws-sm-notebook-instance]
Apply complete! Resources: 1 added, 1 changed, 1 destroyed.
在 this source code in my GitHub-repo (inspired by this tutorial and its related GitHub-repo 的 terraform-folder 中执行 terraform apply
后的整个错误消息):
aws_sagemaker_notebook_instance.notebook_instance: Creating...
aws_sagemaker_notebook_instance.notebook_instance: Still creating... [10s elapsed]
aws_sagemaker_notebook_instance.notebook_instance: Still creating... [20s elapsed]
...
aws_sagemaker_notebook_instance.notebook_instance: Still creating... [15m21s elapsed]
aws_sagemaker_notebook_instance.notebook_instance: Still creating... [15m31s elapsed]
╷
│ Error: error waiting for sagemaker notebook instance (aws-sm-notebook-instance) to create: unexpected state 'Failed', wanted target 'InService'. last error: %!s(<nil>)
│
│ with aws_sagemaker_notebook_instance.notebook_instance,
│ on notebook_instance.tf line 2, in resource "aws_sagemaker_notebook_instance" "notebook_instance":
│ 2: resource "aws_sagemaker_notebook_instance" "notebook_instance" {
│
互联网研究似乎在 this article 中提供了解决方案,其启发是将 on-start.sh
- 脚本中允许的 IDLE_TIME
增加到 IDLE_TIME=1800
(以秒为单位,等于 30 分钟)。这对于大约 15 分钟的部署时间应该足够了;然而,它又抛出了同样的错误。
接下来,我发现this post on Whosebug建议
run
terraform refresh
, which will cause Terraform to refresh its state file against what actually exists with the cloud provider.
不幸的是,刷新后 运行 terraform apply
也没有解决问题。
我想知道为什么前面提到的 IDLE_TIME=1800
- 设置没有任何效果。这对于 15 分钟的应用时间应该绰绰有余。
编辑:添加代码细节以增强理解
1.创建 SageMaker 笔记本实例
resource "aws_sagemaker_notebook_instance" "notebook_instance" {
name = "aws-sm-notebook-instance"
role_arn = aws_iam_role.notebook_iam_role.arn
instance_type = "ml.t2.medium"
lifecycle_config_name = aws_sagemaker_notebook_instance_lifecycle_configuration.notebook_config.name
default_code_repository = aws_sagemaker_code_repository.git_repo.code_repository_name
}
2。定义 SageMaker 笔记本生命周期配置
resource "aws_sagemaker_notebook_instance_lifecycle_configuration" "notebook_config" {
name = "dev-platform-al-sm-lifecycle-config"
on_create = filebase64("../scripts/on-create.sh")
on_start = filebase64("../scripts/on-start.sh")
}
3。定义要在 SageMaker 笔记本实例上实例化的 Git 存储库
resource "aws_sagemaker_code_repository" "git_repo" {
code_repository_name = "aws-sm-notebook-instance-repo"
git_config {
repository_url = "https://github.com/AndreasLuckert/aws-sm-notebook-instance.git"
}
}
on-start.sh
的内容(包括IDLE_TIME - 参数)
请注意,此脚本将由 scripts/autostop.py
- 脚本调用,您可以找到 here in the associated public repo containing the source code.
#!/bin/bash
set -e
## IDLE AUTOSTOP STEPS
## ----------------------------------------------------------------
## Setting the timeout (in seconds) for how long the SageMaker notebook can run idly before being auto-stopped
# -> e.g. 1800 s = 30 min since first deployment can take between 15 and 20 minutes which could then fail like so:
# "Error: error waiting for sagemaker notebook instance (aws-sm-notebook-instance) to create: unexpected state 'Failed', wanted target 'InService'. last error: %!s(<nil>)"
# Hint for solution under following link: https://yuyasugano.medium.com/machine-learning-infrastructure-terraforming-sagemaker-part-2-f2460a9a4663
IDLE_TIME=1800
# Getting the autostop.py script from GitHub
echo "Fetching the autostop script..."
wget https://raw.githubusercontent.com/andreasluckert/aws-sm-notebook-instance/main/scripts/autostop.py
# Using crontab to autostop the notebook when idle time is breached
echo "Starting the SageMaker autostop script in cron."
(crontab -l 2>/dev/null; echo "*/5 * * * * /usr/bin/python $PWD/autostop.py --time $IDLE_TIME --ignore-connections") | crontab -
## CUSTOM CONDA KERNEL USAGE STEPS
## ----------------------------------------------------------------
# Setting the proper user credentials
sudo -u ec2-user -i <<'EOF'
unset SUDO_UID
# Setting the source for the custom conda kernel
WORKING_DIR=/home/ec2-user/SageMaker/custom-miniconda
source "$WORKING_DIR/miniconda/bin/activate"
# Loading all the custom kernels
for env in $WORKING_DIR/miniconda/envs/*; do
BASENAME=$(basename "$env")
source activate "$BASENAME"
python -m ipykernel install --user --name "$BASENAME" --display-name "Custom ($BASENAME)"
done
问题的解决方案是检查 CloudWatch -> Log groups -> /aws/sagemaker/NotebookInstances -> aws-sm-notebook-instance/LifecycleConfigOnCreate
下的 CloudWatch 日志事件以找到以下错误消息:
/bin/bash: /tmp/OnCreate_2021-09-08-12-24rw5al34g: /bin/bash^M: bad interpreter: No such file or directory
一些互联网研究让我找到 this solution related to newline characters in shell-scripts,这取决于您使用的是 Windows
还是 UNIX
系统。
当我在 Windows 上工作时,在 VS-Code 中创建的 shell 脚本包含特定于 dos 的 CRLF
换行符处理,可以通过右下角的按钮解决在 VS-Code
中将 carriage return (CRLF) 字符切换为 UNIX 使用的 line feed (LF) 字符.
由于 AWS Sagemaker 使用的计算实例是一个 Linux 系统,它无法处理 shell 脚本中的 dos 样式的 CRLF 换行符,这“添加”了一个 ^M
在 /bin/bash
之后,这显然会导致错误,因为这样的解释器不存在。
所以,最后 terraform apply
结果很好:
$ terraform apply
...
...
aws_sagemaker_notebook_instance.notebook_instance: Still creating... [7m30s elapsed]
aws_sagemaker_notebook_instance.notebook_instance: Still creating... [7m40s elapsed]
aws_sagemaker_notebook_instance.notebook_instance: Creation complete after 7m43s [id=aws-sm-notebook-instance]
Apply complete! Resources: 1 added, 1 changed, 1 destroyed.