CUDA OOM - 但数字不相加？

Question

我正在尝试使用 PyTorch 训练模型。开始模型训练时，我收到以下错误消息：

RuntimeError: CUDA out of memory. Tried to allocate 5.37 GiB (GPU 0; 7.79 GiB total capacity; 742.54 MiB already allocated; 5.13 GiB free; 792.00 MiB reserved in total by PyTorch)

我想知道为什么会出现此错误。在我看来，我的总容量为 7.79 GiB。它声明的数字 (742 MiB + 5.13 GiB + 792 MiB) 加起来不大于 7.79 GiB。当我检查 nvidia-smi 时，我看到这些进程运行

|    0   N/A  N/A      1047      G   /usr/lib/xorg/Xorg                168MiB |
|    0   N/A  N/A      5521      G   /usr/lib/xorg/Xorg                363MiB |
|    0   N/A  N/A      5637      G   /usr/bin/gnome-shell              161MiB |

我意识到将所有这些数字相加可能会减少它（168 + 363 + 161 + 742 + 792 + 5130 = 7356 MiB），但这仍然小于我的 GPU 的规定容量。

Answer 1

这更像是评论，但值得指出。

大体上确实是 talonmies 评论的原因，但您对数字的总结不正确。让我们看看当张量移动到 GPU 时会发生什么（我在我的 PC 上用 RTX2060 和 5.8G 可用 GPU 显存试过了）：

让我们运行以交互方式执行以下 python 命令：

Python 3.8.10 (default, Sep 28 2021, 16:10:42) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> a = torch.zeros(1).cuda()
>>> b = torch.zeros(500000000).cuda()
>>> c = torch.zeros(500000000).cuda()
>>> d = torch.zeros(500000000).cuda()

以下是watch -n.1 nvidia-smi的输出：

torch 导入后立即：

|    0   N/A  N/A      1121      G   /usr/lib/xorg/Xorg                  4MiB |

创建 a 之后：

|    0   N/A  N/A      1121      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A     14701      C   python                           1251MiB |

如您所见，即使您只需要一个浮点数，您也需要 1251MB 让 pytorch 开始使用 CUDA。

创建 b 之后：

|    0   N/A  N/A      1121      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A     14701      C   python                           3159MiB |

b需要500000000*4 bytes = 1907MB，这与python进程使用的内存增量相同。

创建 c 之后：

|    0   N/A  N/A      1121      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A     14701      C   python                           5067MiB |

不足为奇。

创建 d 之后：

|    0   N/A  N/A      1121      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A     14701      C   python                           5067MiB |

没有进一步的内存分配，并抛出 OOM 错误：

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: CUDA out of memory. Tried to allocate 1.86 GiB (GPU 0; 5.80 GiB total capacity; 3.73 GiB already allocated; 858.81 MiB free; 3.73 GiB reserved in total by PyTorch)

显然：

“已分配”部分包含在“PyTorch 总共保留”部分中。你不能把它们加起来，否则总和超过可用内存总量。
在 GPU 上获取 pytorch 运行ning 所需的最小内存（1251M）不包括在“总保留”部分。

因此在您的情况下，总和应包括：

792MB（共预留）
1251MB（在 GPU 上获得 pytorch 运行ning 的最小值，假设这对我们双方都是一样的）
5.13GB（免费）
168+363+161=692MB（其他进程）

它们加起来约为 7988MB=7.80GB，这正是您的 GPU 总内存。

CUDA OOM - 但数字不相加？

CUDA OOM - But the numbers don't add upp?

python

machine-learning

neural-network

pytorch