通过 Ray Configuration 在 AWS EC2 集群中的节点上禁用超线程

Disabling Hyperthreading on Nodes in an AWS EC2 Cluster through Ray Configuration

我在 EC2 集群上有一个任务 运行,随着虚拟 CPU 的使用(无论 EBS 卷大小如何),它开始逐渐变慢。为避免这种情况,我想在所有节点上禁用超线程并尝试实施此处给出的建议:https://aws.amazon.com/blogs/compute/disabling-intel-hyper-threading-technology-on-amazon-linux/.
我正在使用 Ray 在 Ubuntu 18.04 中启动集群,并假设 config.yaml 文件中的 initialization_commands 部分是实施 bash 命令的适当位置(bootcmd : 那里不理解标题)。我尝试了多种不同的格式,但 none 似乎有效;例如:-

# List of commands run before setup_commands.
initialization_commands:
    - for cpunum in $(cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | cut -s -d, -f2- | tr ',' '\n' | sort -un); do echo 0 > /sys/devices/system/cpu/cpu$cpunum/online; done

产生此错误:-

bash: syntax error near unexpected token `sudo'
2020-07-26 22:53:04,949 INFO log_timer.py:17 -- NodeUpdater: i-0eefc0511ce029fb3: Initialization commands completed [LogTimer=139ms]
2020-07-26 22:53:04,949 INFO log_timer.py:17 -- NodeUpdater: i-0eefc0511ce029fb3: Applied config 39910e8bc12541ca5e316063231a2493642efee4 [LogTimer=60603ms]
2020-07-26 22:53:04,950 ERROR updater.py:348 -- NodeUpdater: i-0eefc0511ce029fb3: Error updating (Exit Status 1) ssh -i /home/haines/.ssh/ray-key2_us-east-1.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_98734ce2b6/5f5c61af53/%C -o ControlPersist=10s -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 ubuntu@3.93.77.73 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && for cpunum in $(cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | cut -s -d, -f2- | tr '"'"','"'"' '"'"'\n'"'"' | sort -un); sudo echo 0 > /sys/devices/system/cpu/cpu$cpunum/online; done'
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/haines/Projects/VF83/Ray_Cloud/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 351, in run
    raise e
  File "/home/haines/Projects/VF83/Ray_Cloud/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 341, in run
    self.do_update()
  File "/home/haines/Projects/VF83/Ray_Cloud/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 426, in do_update
    self.cmd_runner.run(cmd)
  File "/home/haines/Projects/VF83/Ray_Cloud/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 263, in run
    self.process_runner.check_call(final_cmd)
  File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ssh', '-i', '/home/haines/.ssh/ray-key2_us-east-1.pem', '-o', 'ConnectTimeout=120s', '-o', 'StrictHostKeyChecking=no', '-o', 'ControlMaster=auto', '-o', 'ControlPath=/tmp/ray_ssh_98734ce2b6/5f5c61af53/%C', '-o', 'ControlPersist=10s', '-o', 'IdentitiesOnly=yes', '-o', 'ExitOnForwardFailure=yes', '-o', 'ServerAliveInterval=5', '-o', 'ServerAliveCountMax=3', 'ubuntu@3.93.77.73', 'bash', '--login', '-c', '-i', '\'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && for cpunum in $(cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | cut -s -d, -f2- | tr \'"\'"\',\'"\'"\' \'"\'"\'\n\'"\'"\' | sort -un); sudo echo 0 > /sys/devices/system/cpu/cpu$cpunum/online; done\'']' returned non-zero exit status 1.

2020-07-26 22:53:05,018 INFO log_timer.py:17 -- AWSNodeProvider: Set tag ray-node-status=setting-up on ['i-0eefc0511ce029fb3'] [LogTimer=205ms]
2020-07-26 22:53:05,140 ERROR commands.py:285 -- get_or_create_head_node: Updating 3.93.77.73 failed

我试过使用单独的行,并将命令放在 setup_commands 部分,但是 none 这些工作。有没有更简单的方法?

更新:我猜语法错误可能与某些空格或字符有关(尽管我尝试了很多变体),但即使没有循环,即只有 sudo echo 命令写入一个 cpu, 我收到权限错误:-

bash: /sys/devices/system/cpu/cpu50/online: Permission denied

更新 2:我发现有一个更简单的方法:“导出 OMP_NUM_THREADS=1”,但如果通过设置中的 bash 命令完成,这似乎没有效果。我正在使用 Ray 0.8.6,我认为它应该设置 OMP_NUM_THREADS=1,但是当集群启动时 head-node 和 运行.

嗯,设置OMP_NUM_THREADS好像没什么用。该解决方案是第一个,由 AWS 描述,但它还需要为所有 CPU 在线标志添加写入权限,在 Ray 配置文件中:-

setup_commands:
    - sudo chmod -R 777 /sys/devices/system/cpu/*
    - for cpunum in $(cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | cut -s -d, -f2- | tr ',' '\n' | sort -un); do echo 0 > /sys/devices/system/cpu/cpu$cpunum/online; done

这允许任意数量的任务在所有实际 CPU 上同时 运行 作为一个。当然,这也意味着我要运行两倍的工人