由于 root 权限问题,使用 yaml 在 aws 上启动 Ray 集群失败

Ray cluster launch on aws with yaml fails due to root permission issue

我正在尝试使用下面的 yaml 文件启动 ray 集群,但我收到此错误消息:

bash: /root/ray_bootstrap_config.yaml: Permission denied

我认为这可能是因为从我启动集群的地方​​访问本地根文件夹需要权限。如果我在本地转到此文件夹,如图所示,单击根时需要凭据:click here for image

网上有人说我需要挂载文件,但目前我还做不到。

资源:https://github.com/ray-project/ray/issues/9326

集群最初启动,但在 运行 yaml 文件时发生此错误。它成功连接到 aws luanching head 和 worker 节点,首先成功安装了一些依赖项,例如 boto ect,如 initilization_commands 所示,但随后卡在显示的错误上。

这是我的 Yaml:

# An unique identifier for the head node and workers of this cluster.
cluster_name: ray-pipeline-test #ray_example_aws

# The maximum number of workers nodes to launch in addition to the head
# node. This takes precedence over min_workers. min_workers default to 0.
max_workers: 1

docker:
    image: "xxxxxxxx1546.dkr.ecr.eu-west-2.amazonaws.com/xxxxx/pipeline:ray-aws" 
 
    container_name: "ray_xxxxxxx_pipeline_aws"      #"ray_nvidia_docker" # e.g. ray_docker
    pull_before_run: True

idle_timeout_minutes: 5



# Cloud-provider specific configuration.
provider:
    type: aws
    region: eu-west-2
    availability_zone: eu-west-2a

initialization_commands:

      #- conda install python==3.6
#      - wget https://repo.continuum.io/archive/Anaconda3-5.0.1-Linux-x86_64.sh || true
#      - bash Anaconda3-5.0.1-Linux-x86_64.sh -b -p $HOME/anaconda3 || true
#      - echo 'export PATH="$HOME/anaconda3/bin:$PATH"' >> ~/.bashrc
#      - conda create -n py36 python=3.6 anaconda

      #- wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
     # - sh Miniconda3-latest-Linux-x86_64.sh
      - source .bashrc
      - conda update conda -n base
      - conda create -n py36 python=3.6
      - conda activate py36


      - curl -fsSL https://get.docker.com -o get-docker.sh
      - sudo sh get-docker.sh
      - sudo usermod -aG docker $USER
      - sudo systemctl restart docker -f



      - sudo apt-get update
      - sudo apt-get upgrade
      - sudo apt-get install -y python-setuptools
      - sudo apt-get install -y build-essential curl unzip psmisc
      - pip install boto boto3
      - conda install boto boto3
      - pip install awscli
      - sudo pip install --default-timeout=100 future
      - pip install ray==1.0.1.post1
      - aws configure set aws_access_key_id xxxxxxxxxxx
      - aws configure set aws_secret_access_key xxxxxxxxxxxxxxxxxxxxx
      - eval $(aws ecr get-login --no-include-email --region eu-west-2)

auth:
    ssh_user:  ubuntu
    ssh_private_key: /home/user/.ssh/aws_ubuntu_test.pem

head_node:
    InstanceType: c5.2xlarge
    ImageId: ami-xxxxxxxb31fd2c
    KeyName: aws_ubuntu_test

    BlockDeviceMappings:
      - DeviceName: /dev/sda1
        Ebs:
          VolumeSize: 200


worker_nodes:
   InstanceType: c5.2xlarge
   ImageId: ami-xxxxxxxxx31fd2c
   KeyName: aws_ubuntu_test

在 Ray 中使用自定义 docker 图像时,您应该确保它基于 rayproject/ray 图像,因为 Ray 的自动缩放器对容器上的内容、用户的需求有很多期望它将 运行 作为,并且 settings/optimizations 它可以改变什么。