CUDA: RuntimeError: CUDA out of memory - BERT sagemaker

CUDA: RuntimeError: CUDA out of memory - BERT sagemaker

我一直在尝试使用 AWS Sagemaker 训练 BertSequenceForClassification 模型。我正在使用拥抱面估计器。但我一直收到错误消息:RuntimeError: CUDA out of memory. Tried to allocate 192.00 MiB (GPU 0; 11.17 GiB total capacity; 10.73 GiB already allocated; 87.88 MiB free; 10.77 GiB reserved in total by PyTorch) 相同的代码 运行 在我的笔记本电脑上没问题。

  1. 我如何查看占用那 10GB 内存的是什么?我的数据集非常小 (68kb),我的批量大小 (8) 和时期 (1) 也是如此。当我 运行 nvidia-smi 时,我只能看到“No processes 运行ning”并且 GPU 内存使用为零。当我 运行 print(torch.cuda.memory_summary(device=None, abbreviated=False)) 从我的训练脚本中(就在它抛出错误之前)它打印
|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Active memory         |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| GPU reserved memory   |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Allocations           |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Active allocs         |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|===========================================================================|

但我不知道它是什么意思或如何解释它

  1. 当我 运行 !df -h 我可以看到:
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs         30G   72K   30G   1% /dev
tmpfs            30G     0   30G   0% /dev/shm
/dev/xvda1      109G   93G   16G  86% /
/dev/xvdf       196G   61M  186G   1% /home/ec2-user/SageMaker

此内存与 GPU 有何不同?如果 /dev/xvdf 中有 200GB,我可以直接使用它吗?在我的测试脚本中我尝试了
model = BertForSequenceClassification.from_pretrained(args.model_name,num_labels=args.num_labels).to("cpu") 但这只是给出了同样的错误

CUDA out of memory 错误表示您的 GPU RAM(随机存取存储器)已满。这与您设备上的存储空间不同(这是您通过 df -h 命令获得的信息)。

此内存由您加载到 GPU 内存中的模型占用,与您的数据集大小无关。模型所需的 GPU 内存至少是模型实际大小的两倍,但很可能接近 4 倍(初始权重、检查点、梯度、优化器状态等)。

您可以尝试的事情:

  • 为实例提供更多 GPU 内存
  • 减少批量大小
  • 使用不同的(更小的)模型