CUDA: RuntimeError: CUDA out of memory - BERT sagemaker

Question

我一直在尝试使用 AWS Sagemaker 训练 BertSequenceForClassification 模型。我正在使用拥抱面估计器。但我一直收到错误消息：RuntimeError: CUDA out of memory. Tried to allocate 192.00 MiB (GPU 0; 11.17 GiB total capacity; 10.73 GiB already allocated; 87.88 MiB free; 10.77 GiB reserved in total by PyTorch) 相同的代码运行在我的笔记本电脑上没问题。

我如何查看占用那 10GB 内存的是什么？我的数据集非常小 (68kb)，我的批量大小 (8) 和时期 (1) 也是如此。当我运行 nvidia-smi 时，我只能看到“No processes 运行ning”并且 GPU 内存使用为零。当我运行 print(torch.cuda.memory_summary(device=None, abbreviated=False)) 从我的训练脚本中（就在它抛出错误之前）它打印

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Active memory         |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| GPU reserved memory   |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Allocations           |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Active allocs         |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|===========================================================================|

但我不知道它是什么意思或如何解释它

当我运行 !df -h 我可以看到：

Filesystem      Size  Used Avail Use% Mounted on
devtmpfs         30G   72K   30G   1% /dev
tmpfs            30G     0   30G   0% /dev/shm
/dev/xvda1      109G   93G   16G  86% /
/dev/xvdf       196G   61M  186G   1% /home/ec2-user/SageMaker

此内存与 GPU 有何不同？如果 /dev/xvdf 中有 200GB，我可以直接使用它吗？在我的测试脚本中我尝试了
model = BertForSequenceClassification.from_pretrained(args.model_name,num_labels=args.num_labels).to("cpu") 但这只是给出了同样的错误

Answer 1

CUDA out of memory 错误表示您的 GPU RAM（随机存取存储器）已满。这与您设备上的存储空间不同（这是您通过 df -h 命令获得的信息）。

此内存由您加载到 GPU 内存中的模型占用，与您的数据集大小无关。模型所需的 GPU 内存至少是模型实际大小的两倍，但很可能接近 4 倍（初始权重、检查点、梯度、优化器状态等）。

您可以尝试的事情：

为实例提供更多 GPU 内存
减少批量大小
使用不同的（更小的）模型

CUDA: RuntimeError: CUDA out of memory - BERT sagemaker

CUDA: RuntimeError: CUDA out of memory - BERT sagemaker

python

gpu

amazon-sagemaker

huggingface-transformers