如何防止审判执行在头上

Question

我在 aws“Autoscaling GPU 集群”上使用 ray.tune。目前，我的头和工人都有一个GPU，都用来执行试验。我正在尝试移动到头部没有 GPU 的设置 -- 按照 Ray 的文档如何定义“Autoscaling GPU cluster”的思路。但是，我一直运行ning 关注 CUDA 问题，这是有道理的，因为它用于试验执行。解决方案看起来很简单：我想我需要防止在头上执行试验，但我找不到方法。我尝试了各种 resources_per_trial 值，与 ray.init() 相同，但没有成功。

其他详细信息：

我用的是 ray 0.8.6.
我设置resources_per_trial={'gpu': 1}
我到处都设置torch.device("cuda:0")
我使用 1 个头（仅cpu）和 1 个工人（仅 gpu），我至少需要 1 个工人。

所以一切都是运行只在 GPU 上进行的，这就是为什么我专注于防止头部执行。

关于错误和警告，我得到以下信息：

WARNING tune.py:318 -- Tune detects GPUs, but no trials are using GPUs. To enable trials to use GPUs, set tune.run(resources_per_trial={'gpu': 1}...) which allows Tune to expose 1 GPU to each trial. You can also override `Trainable.default_resource_request` if using the Trainable API.

WARNING ray_trial_executor.py:549 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().


WARNING worker.py:1047 -- The actor or task with ID ffffffffffffffff128bce290200 is pending and cannot currently be scheduled. It requires {CPU: 1.000000}, {GPU: 1.000000} for execution and {CPU: 1.000000}, {GPU: 1.000000} for placement, but this node only has remaining {node:10.160.26.189: 1.000000}, {object_store_memory: 12.304688 GiB}, {CPU: 3.000000}, {memory: 41.650391 GiB}. In total there are 0 pending tasks and 1 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

即使我等待 gpu worker 运行ning 我仍然得到上面的结果。

最后报错是：

ERROR trial_runner.py:520 -- Trial TrainableAE_a441f_00000: Error processing event.
Traceback (most recent call last):
  File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 468, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 430, in fetch_result
    result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT)
  File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/worker.py", line 1467, in get
    values = worker.get_objects(object_ids, timeout=timeout)
  File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/worker.py", line 306, in get_objects
    return self.deserialize_objects(data_metadata_pairs, object_ids)
  File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/worker.py", line 281, in deserialize_objects
    return context.deserialize_objects(data_metadata_pairs, object_ids)
  File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/serialization.py", line 312, in deserialize_objects
    self._deserialize_object(data, metadata, object_id))
  File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/serialization.py", line 252, in _deserialize_object
    return self._deserialize_msgpack_data(data, metadata)
  File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/serialization.py", line 233, in _deserialize_msgpack_data
    python_objects = self._deserialize_pickle5_data(pickle5_data)
  File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/serialization.py", line 221, in _deserialize_pickle5_data
    obj = pickle.loads(in_band)
  File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/torch/storage.py", line 136, in _load_from_bytes
    return torch.load(io.BytesIO(b))
  File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/torch/serialization.py", line 593, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/torch/serialization.py", line 773, in _legacy_load
    result = unpickler.load()
  File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/torch/serialization.py", line 729, in persistent_load
    deserialized_objects[root_key] = restore_location(obj, location)
  File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/torch/serialization.py", line 178, in default_restore_location
    result = fn(storage, location)
  File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/torch/serialization.py", line 154, in _cuda_deserialize
    device = validate_cuda_device(location)
  File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/torch/serialization.py", line 138, in validate_cuda_device
    raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

Answer 1

感谢 richliaw 的评论。一旦我不再试图阻止对头部的试验执行，而是专注于找出这些首先发生的原因，解决方案就变得显而易见了。我集群头部的 AMI 安装了 NVidia 驱动程序和 cuda。在我移除那些射线后，它们不再试图在头部执行。所以我想这就是当 resources_per_trial={'gpu': 1}.

时 ray 决定在头部发送计算的方式

如何防止审判执行在头上

How to prevent trials execution on the head

ray

其他详细信息：