由于 root 权限问题,使用 yaml 在 aws 上启动 Ray 集群失败
Ray cluster launch on aws with yaml fails due to root permission issue
我正在尝试使用下面的 yaml 文件启动 ray 集群,但我收到此错误消息:
bash: /root/ray_bootstrap_config.yaml: Permission denied
我认为这可能是因为从我启动集群的地方访问本地根文件夹需要权限。如果我在本地转到此文件夹,如图所示,单击根时需要凭据:click here for image
网上有人说我需要挂载文件,但目前我还做不到。
资源:https://github.com/ray-project/ray/issues/9326
集群最初启动,但在 运行 yaml 文件时发生此错误。它成功连接到 aws luanching head 和 worker 节点,首先成功安装了一些依赖项,例如 boto ect,如 initilization_commands 所示,但随后卡在显示的错误上。
这是我的 Yaml:
# An unique identifier for the head node and workers of this cluster.
cluster_name: ray-pipeline-test #ray_example_aws
# The maximum number of workers nodes to launch in addition to the head
# node. This takes precedence over min_workers. min_workers default to 0.
max_workers: 1
docker:
image: "xxxxxxxx1546.dkr.ecr.eu-west-2.amazonaws.com/xxxxx/pipeline:ray-aws"
container_name: "ray_xxxxxxx_pipeline_aws" #"ray_nvidia_docker" # e.g. ray_docker
pull_before_run: True
idle_timeout_minutes: 5
# Cloud-provider specific configuration.
provider:
type: aws
region: eu-west-2
availability_zone: eu-west-2a
initialization_commands:
#- conda install python==3.6
# - wget https://repo.continuum.io/archive/Anaconda3-5.0.1-Linux-x86_64.sh || true
# - bash Anaconda3-5.0.1-Linux-x86_64.sh -b -p $HOME/anaconda3 || true
# - echo 'export PATH="$HOME/anaconda3/bin:$PATH"' >> ~/.bashrc
# - conda create -n py36 python=3.6 anaconda
#- wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
# - sh Miniconda3-latest-Linux-x86_64.sh
- source .bashrc
- conda update conda -n base
- conda create -n py36 python=3.6
- conda activate py36
- curl -fsSL https://get.docker.com -o get-docker.sh
- sudo sh get-docker.sh
- sudo usermod -aG docker $USER
- sudo systemctl restart docker -f
- sudo apt-get update
- sudo apt-get upgrade
- sudo apt-get install -y python-setuptools
- sudo apt-get install -y build-essential curl unzip psmisc
- pip install boto boto3
- conda install boto boto3
- pip install awscli
- sudo pip install --default-timeout=100 future
- pip install ray==1.0.1.post1
- aws configure set aws_access_key_id xxxxxxxxxxx
- aws configure set aws_secret_access_key xxxxxxxxxxxxxxxxxxxxx
- eval $(aws ecr get-login --no-include-email --region eu-west-2)
auth:
ssh_user: ubuntu
ssh_private_key: /home/user/.ssh/aws_ubuntu_test.pem
head_node:
InstanceType: c5.2xlarge
ImageId: ami-xxxxxxxb31fd2c
KeyName: aws_ubuntu_test
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
VolumeSize: 200
worker_nodes:
InstanceType: c5.2xlarge
ImageId: ami-xxxxxxxxx31fd2c
KeyName: aws_ubuntu_test
在 Ray 中使用自定义 docker 图像时,您应该确保它基于 rayproject/ray
图像,因为 Ray 的自动缩放器对容器上的内容、用户的需求有很多期望它将 运行 作为,并且 settings/optimizations 它可以改变什么。
我正在尝试使用下面的 yaml 文件启动 ray 集群,但我收到此错误消息:
bash: /root/ray_bootstrap_config.yaml: Permission denied
我认为这可能是因为从我启动集群的地方访问本地根文件夹需要权限。如果我在本地转到此文件夹,如图所示,单击根时需要凭据:click here for image
网上有人说我需要挂载文件,但目前我还做不到。
资源:https://github.com/ray-project/ray/issues/9326
集群最初启动,但在 运行 yaml 文件时发生此错误。它成功连接到 aws luanching head 和 worker 节点,首先成功安装了一些依赖项,例如 boto ect,如 initilization_commands 所示,但随后卡在显示的错误上。
这是我的 Yaml:
# An unique identifier for the head node and workers of this cluster.
cluster_name: ray-pipeline-test #ray_example_aws
# The maximum number of workers nodes to launch in addition to the head
# node. This takes precedence over min_workers. min_workers default to 0.
max_workers: 1
docker:
image: "xxxxxxxx1546.dkr.ecr.eu-west-2.amazonaws.com/xxxxx/pipeline:ray-aws"
container_name: "ray_xxxxxxx_pipeline_aws" #"ray_nvidia_docker" # e.g. ray_docker
pull_before_run: True
idle_timeout_minutes: 5
# Cloud-provider specific configuration.
provider:
type: aws
region: eu-west-2
availability_zone: eu-west-2a
initialization_commands:
#- conda install python==3.6
# - wget https://repo.continuum.io/archive/Anaconda3-5.0.1-Linux-x86_64.sh || true
# - bash Anaconda3-5.0.1-Linux-x86_64.sh -b -p $HOME/anaconda3 || true
# - echo 'export PATH="$HOME/anaconda3/bin:$PATH"' >> ~/.bashrc
# - conda create -n py36 python=3.6 anaconda
#- wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
# - sh Miniconda3-latest-Linux-x86_64.sh
- source .bashrc
- conda update conda -n base
- conda create -n py36 python=3.6
- conda activate py36
- curl -fsSL https://get.docker.com -o get-docker.sh
- sudo sh get-docker.sh
- sudo usermod -aG docker $USER
- sudo systemctl restart docker -f
- sudo apt-get update
- sudo apt-get upgrade
- sudo apt-get install -y python-setuptools
- sudo apt-get install -y build-essential curl unzip psmisc
- pip install boto boto3
- conda install boto boto3
- pip install awscli
- sudo pip install --default-timeout=100 future
- pip install ray==1.0.1.post1
- aws configure set aws_access_key_id xxxxxxxxxxx
- aws configure set aws_secret_access_key xxxxxxxxxxxxxxxxxxxxx
- eval $(aws ecr get-login --no-include-email --region eu-west-2)
auth:
ssh_user: ubuntu
ssh_private_key: /home/user/.ssh/aws_ubuntu_test.pem
head_node:
InstanceType: c5.2xlarge
ImageId: ami-xxxxxxxb31fd2c
KeyName: aws_ubuntu_test
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
VolumeSize: 200
worker_nodes:
InstanceType: c5.2xlarge
ImageId: ami-xxxxxxxxx31fd2c
KeyName: aws_ubuntu_test
在 Ray 中使用自定义 docker 图像时,您应该确保它基于 rayproject/ray
图像,因为 Ray 的自动缩放器对容器上的内容、用户的需求有很多期望它将 运行 作为,并且 settings/optimizations 它可以改变什么。