分配 X 字节统一内存失败；结果：CUDA_ERROR_OUT_OF_MEMORY：内存不足

Question

我正在尝试运行一个 tensorflow 项目，但我在大学 HPC 集群上遇到了内存问题。我必须运行对数百个不同长度的输入进行预测。我们的 GPU 节点具有不同数量的 vmem，因此我尝试以一种不会在 GPU 节点 - 输入长度的任何组合中崩溃的方式设置脚本。

在网上搜索解决方案后，我把玩了 TF_FORCE_UNIFIED_MEMORY、XLA_PYTHON_CLIENT_MEM_FRACTION、XLA_PYTHON_CLIENT_PREALLOCATE 和 TF_FORCE_GPU_ALLOW_GROWTH，还有 tensorflow 的 set_memory_growth .据我了解，有了统一内存，我应该能够使用比 GPU 本身更多的内存。

这是我的最终解决方案（仅相关部分）

os.environ['TF_FORCE_UNIFIED_MEMORY']='1'
os.environ['XLA_PYTHON_CLIENT_MEM_FRACTION']='2.0'
#os.environ['XLA_PYTHON_CLIENT_PREALLOCATE']='false'
os.environ['TF_FORCE_GPU_ALLOW_GROWTH ']='true' # as I understood, this is redundant with the set_memory_growth part :)

import tensorflow as tf    
gpus = tf.config.list_physical_devices('GPU')
if gpus:
  try:
    # Currently, memory growth needs to be the same across GPUs
    for gpu in gpus:
      print(gpu)
      tf.config.experimental.set_memory_growth(gpu, True)
    logical_gpus = tf.config.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Memory growth must be set before GPUs have been initialized
    print(e)

然后我使用 --mem=30G（slurm 作业调度程序）和 --gres=gpu:1.

在集群上提交它

这是我的代码崩溃的错误。据我了解，它确实尝试使用统一内存，但由于某种原因失败了。

Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5582 MB memory) -> physical GPU (device: 0, name: GeForce GTX TITAN Black, pci bus id: 0000:02:00.0, compute capability: 3.5)
2021-08-24 09:22:02.053935: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 12758286336 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:03.738635: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 11482457088 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:05.418059: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 10334211072 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:07.102411: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 9300789248 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:08.784349: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 8370710016 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:10.468644: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 7533638656 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:12.150588: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 6780274688 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:23:10.326528: W external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:272] Allocator (GPU_0_bfc) ran out of memory trying to allocate 4.33GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.


Traceback (most recent call last):
  File "scripts/script.py", line 654, in <module>
    prediction_result, (r, t) = cf.to(model_runner.predict(processed_feature_dict, random_seed=seed), "cpu")
  File "env/lib/python3.7/site-packages/alphafold/model/model.py", line 134, in predict
    result, recycles = self.apply(self.params, jax.random.PRNGKey(random_seed), feat)
  File "env/lib/python3.7/site-packages/jax/_src/traceback_util.py", line 183, in reraise_with_filtered_traceback
    return fun(*args, **kwargs)
  File "env/lib/python3.7/site-packages/jax/_src/api.py", line 402, in cache_miss
    donated_invars=donated_invars, inline=inline)
  File "env/lib/python3.7/site-packages/jax/core.py", line 1561, in bind
    return call_bind(self, fun, *args, **params)
  File "env/lib/python3.7/site-packages/jax/core.py", line 1552, in call_bind
    outs = primitive.process(top_trace, fun, tracers, params)
  File "env/lib/python3.7/site-packages/jax/core.py", line 1564, in process
    return trace.process_call(self, fun, tracers, params)
  File "env/lib/python3.7/site-packages/jax/core.py", line 607, in process_call
    return primitive.impl(f, *tracers, **params)
  File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 608, in _xla_call_impl
    *unsafe_map(arg_spec, args))
  File "env/lib/python3.7/site-packages/jax/linear_util.py", line 262, in memoized_fun
    ans = call(fun, *args)
  File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 758, in _xla_callable
    compiled = compile_or_get_cached(backend, built, options)
  File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 76, in compile_or_get_cached
    return backend_compile(backend, computation, compile_options)
  File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 373, in backend_compile
    return backend.compile(built_c, compile_options=options)
jax._src.traceback_util.UnfilteredStackTrace: RuntimeError: Resource exhausted: Out of memory while trying to allocate 4649385984 bytes.

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.

--------------------

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "scripts/script.py", line 654, in <module>
    prediction_result, (r, t) = cf.to(model_runner.predict(processed_feature_dict, random_seed=seed), "cpu")
  File "env/lib/python3.7/site-packages/alphafold/model/model.py", line 134, in predict
    result, recycles = self.apply(self.params, jax.random.PRNGKey(random_seed), feat)
  File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 373, in backend_compile
    return backend.compile(built_c, compile_options=options)
RuntimeError: Resource exhausted: Out of memory while trying to allocate 4649385984 bytes.

如果有任何关于如何让它工作和使用所有可用内存的想法，我将很高兴。

谢谢！

Answer 1

可能 answer will be useful for you. This nvidia_smi python 模块有一些有用的工具，比如检查 gpu 总内存。这里我复现了我之前提到的答案的代码。

import nvidia_smi

nvidia_smi.nvmlInit()
handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0)

info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)

print("Total memory:", info.total)

nvidia_smi.nvmlShutdown()

我认为这应该是您的起点。一个简单的解决方案是根据 gpu 内存设置批量大小。如果您只想获得预测，除了 batch_size，通常没有其他任何东西占用太多内存。另外，我建议，如果对 gpu 进行了任何预处理，请将其传递给 cpu。

Answer 2

您的 GPU 似乎不完全支持统一内存。支持有限，事实上 GPU 将所有数据保存在其内存中。

描述见这篇文章：https://developer.nvidia.com/blog/unified-memory-cuda-beginners/

特别是：

On systems with pre-Pascal GPUs like the Tesla K80, calling cudaMallocManaged() allocates size bytes of managed memory on the GPU device that is active when the call is made. Internally, the driver also sets up page table entries for all pages covered by the allocation, so that the system knows that the pages are resident on that GPU.

并且：

Since these older GPUs can’t page fault, all data must be resident on the GPU just in case the kernel accesses it (even if it won’t).

根据 TechPowerUp，您的 GPU 是基于 Kepler 的：https://www.techpowerup.com/gpu-specs/geforce-gtx-titan-black.c2549

据我所知，TensorFlow 也应该对此发出警告。类似于：

计算能力低于 6.0 的 GPU 上的统一内存（pre-Pascal class GPU）不支持超额订阅。

分配 X 字节统一内存失败；结果：CUDA_ERROR_OUT_OF_MEMORY：内存不足

failed to alloc X bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory

gpu

out-of-memory

slurm

tensorflow