Ray 集群连接已关闭,示例脚本显示 127.0.0.1 而不是头 IP 地址

Ray Cluster Connection Closed, example script displays 127.0.0.1 instead of head IP address

我正在尝试在 Linux 上启动 Ray Cluster。

我正在使用 Python 3.9.5

下面的命令是 运行 来自我的 HEAD 节点 (192.168.26.47)

我正在使用 ray up config.yaml 启动它,其中 config.yaml 是:

# A unique identifier for the head node and workers of this cluster.
cluster_name: default

# Running Ray in Docker images is optional (this docker section can be commented out).
# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled. Assumes Docker is installed.
docker:
    # image: "rayproject/ray-ml:latest-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
    # image: rayproject/ray:latest-gpu   # use this one if you don't need ML dependencies, it's faster to pull
    image: rayproject/ray:1.12.0-py39-cpu
    container_name: "ray_container"
    # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
    # if no cached version is present.
    pull_before_run: True
    run_options:   # Extra options to pass into "docker run"
        - --ulimit nofile=65536:65536

provider:
    type: local
    head_ip: 192.168.26.47
    # You may need to supply a public ip for the head node if you need
    # to run `ray up` from outside of the Ray cluster's network
    # (e.g. the cluster is in an AWS VPC and you're starting ray from your laptop)
    # This is useful when debugging the local node provider with cloud VMs.
    # external_head_ip: YOUR_HEAD_PUBLIC_IP
    worker_ips: [192.168.26.43]
    #,192.168.26.50]
    # Optional when running automatic cluster management on prem. If you use a coordinator server,
    # then you can launch multiple autoscaling clusters on the same set of machines, and the coordinator
    # will assign individual nodes to clusters as needed.
    #    coordinator_address: "<host>:<port>"

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: user.name
    # You can comment out `ssh_private_key` if the following machines don't need a private key for SSH access to the Ray
    # cluster:
    #   (1) The machine on which `ray up` is executed.
    #   (2) The head node of the Ray cluster.
    #
    # The machine that runs ray up executes SSH commands to set up the Ray head node. The Ray head node subsequently
    # executes SSH commands to set up the Ray worker nodes. When you run ray up, ssh credentials sitting on the ray up
    # machine are copied to the head node -- internally, the ssh key is added to the list of file mounts to rsync to head node.
    ssh_private_key: ~/.ssh/id_rsa

# The minimum number of workers nodes to launch in addition to the head
# node. This number should be >= 0.
# Typically, min_workers == max_workers == len(worker_ips).
min_workers: 0

# The maximum number of workers nodes to launch in addition to the head node.
# This takes precedence over min_workers.
# Typically, min_workers == max_workers == len(worker_ips).
max_workers: 0
# The default behavior for manually managed clusters is
# min_workers == max_workers == len(worker_ips),
# meaning that Ray is started on all available nodes of the cluster.
# For automatically managed clusters, max_workers is required and min_workers defaults to 0.

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

idle_timeout_minutes: 5

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH. E.g. you could save your conda env to an environment.yaml file, mount
# that directory to all nodes and call `conda -n my_env -f /path1/on/remote/machine/environment.yaml`. In this
# example paths on all nodes must be the same (so that conda can be called always with the same argument)
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []

# List of shell commands to run to set up each nodes.
setup_commands: []
    # If we have e.g. conda dependencies stored in "/path1/on/local/machine/environment.yaml", we can prepare the
    # work environment on each worker by:
    #   1. making sure each worker has access to this file i.e. see the `file_mounts` section
    #   2. adding a command here that creates a new conda environment on each node or if the environment already exists,
    #     it updates it:
    #      conda env create -q -n my_venv -f /path1/on/local/machine/environment.yaml || conda env update -q -n my_venv -f /path1/on/local/machine/environment.yaml
    #
    # Ray developers:
    # you probably want to create a Docker image that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
    # that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
    # - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"

# Custom commands that will be run on the head node after common setup.
head_setup_commands: []

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
  # If we have e.g. conda dependencies, we could create on each node a conda environment (see `setup_commands` section).
  # In that case we'd have to activate that env on each node before running `ray`:
  # - conda activate my_venv && ray stop
  # - conda activate my_venv && ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml
    - ray stop
    - ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
  # If we have e.g. conda dependencies, we could create on each node a conda environment (see `setup_commands` section).
  # In that case we'd have to activate that env on each node before running `ray`:
  # - conda activate my_venv && ray stop
  # - ray start --address=$RAY_HEAD_IP:6379
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379

当 ray 启动时,我看到很多这样的行:

Shared connection to 192.168.26.47 closed.

这是预期的吗?

我随后 运行 这个:

ray submit config.yaml script3.py

script.py 是:

from collections import Counter
import socket
import time
import ray

ray.init(address="auto")

@ray.remote
def f():
    time.sleep(0.001)
    # Return IP address.
    return socket.gethostbyname(socket.gethostname())

object_ids = [f.remote() for _ in range(10000)]
ip_addresses = ray.get(object_ids)
print(Counter(ip_addresses))

输出为

Terry.Cay@dev104:~/ray$ ray submit config.yaml script3.py 
Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
2022-04-29 16:21:13,252 INFO node_provider.py:49 -- ClusterState: Loaded cluster state: ['192.168.26.43', '192.168.26.47']
Fetched IP: 192.168.26.47
Shared connection to 192.168.26.47 closed.
Shared connection to 192.168.26.47 closed.
Fetched IP: 192.168.26.47
Shared connection to 192.168.26.47 closed.
Counter({'127.0.0.1': 100})
Shared connection to 192.168.26.47 closed.

根据文档,我应该在输出中看到 IP 192.168.26.43 和 192.168.26.43。 另外,所有共享连接关闭消息都应该在那里吗?

提前致谢。

添加头部和节点机器的更多详细信息:

lsb_release -a returns

No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.4 LTS
Release:    20.04
Codename:   focal

两台机器都是虚拟机。

结束评论:

接受德米特里的回答,因为错误是良性的。我的根本问题是我需要设置 min 和 max workers 参数,让集群使用工作节点。意外的 IP 地址并不重要。

这看起来像是一个错误,与这个错误密切相关: https://github.com/ray-project/ray/issues/24130

我将 link 参加关于该问题的讨论,希望有人可以调查。

能说说head和worker是怎么设置的吗? 可能存在一些微妙的网络问题。

Shared connection to 192.168.26.47 closed. 是良性的。

为子孙后代记录发生的事情 -- 问题出在 min_workers 和 max_workers 设置上:

# The minimum number of workers nodes to launch in addition to the head
# node. This number should be >= 0.
# Typically, min_workers == max_workers == len(worker_ips).
min_workers: 0

# The maximum number of workers nodes to launch in addition to the head node.
# This takes precedence over min_workers.
# Typically, min_workers == max_workers == len(worker_ips).
max_workers: 0

这些设置向 Ray 的集群管理器(也称为自动缩放器)暗示集群中不能使用工作节点。

解决方案是

  • 将这些值设置为您要使用的工作器数量
  • 省略这些值,在这种情况下,Ray 会将它们默认为提供的 worker ip 的数量。

这些是 Ray 维护者要解决的问题:

  • 删除示例配置中的 0 值。
  • 如果 max_workers 小于提供的 ips 数量,则记录警告。
  • 记录默认行为