CUDA: RuntimeError: CUDA out of memory - BERT sagemaker
CUDA: RuntimeError: CUDA out of memory - BERT sagemaker
我一直在尝试使用 AWS Sagemaker 训练 BertSequenceForClassification 模型。我正在使用拥抱面估计器。但我一直收到错误消息:RuntimeError: CUDA out of memory. Tried to allocate 192.00 MiB (GPU 0; 11.17 GiB total capacity; 10.73 GiB already allocated; 87.88 MiB free; 10.77 GiB reserved in total by PyTorch)
相同的代码 运行 在我的笔记本电脑上没问题。
- 我如何查看占用那 10GB 内存的是什么?我的数据集非常小 (68kb),我的批量大小 (8) 和时期 (1) 也是如此。当我 运行 nvidia-smi 时,我只能看到“No processes 运行ning”并且 GPU 内存使用为零。当我 运行
print(torch.cuda.memory_summary(device=None, abbreviated=False))
从我的训练脚本中(就在它抛出错误之前)它打印
|===========================================================================|
| PyTorch CUDA memory summary, device ID 0 |
|---------------------------------------------------------------------------|
| CUDA OOMs: 0 | cudaMalloc retries: 0 |
|===========================================================================|
| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
|---------------------------------------------------------------------------|
| Allocated memory | 0 B | 0 B | 0 B | 0 B |
| from large pool | 0 B | 0 B | 0 B | 0 B |
| from small pool | 0 B | 0 B | 0 B | 0 B |
|---------------------------------------------------------------------------|
| Active memory | 0 B | 0 B | 0 B | 0 B |
| from large pool | 0 B | 0 B | 0 B | 0 B |
| from small pool | 0 B | 0 B | 0 B | 0 B |
|---------------------------------------------------------------------------|
| GPU reserved memory | 0 B | 0 B | 0 B | 0 B |
| from large pool | 0 B | 0 B | 0 B | 0 B |
| from small pool | 0 B | 0 B | 0 B | 0 B |
|---------------------------------------------------------------------------|
| Non-releasable memory | 0 B | 0 B | 0 B | 0 B |
| from large pool | 0 B | 0 B | 0 B | 0 B |
| from small pool | 0 B | 0 B | 0 B | 0 B |
|---------------------------------------------------------------------------|
| Allocations | 0 | 0 | 0 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 0 | 0 | 0 | 0 |
|---------------------------------------------------------------------------|
| Active allocs | 0 | 0 | 0 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 0 | 0 | 0 | 0 |
|---------------------------------------------------------------------------|
| GPU reserved segments | 0 | 0 | 0 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 0 | 0 | 0 | 0 |
|---------------------------------------------------------------------------|
| Non-releasable allocs | 0 | 0 | 0 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 0 | 0 | 0 | 0 |
|===========================================================================|
但我不知道它是什么意思或如何解释它
- 当我 运行
!df -h
我可以看到:
Filesystem Size Used Avail Use% Mounted on
devtmpfs 30G 72K 30G 1% /dev
tmpfs 30G 0 30G 0% /dev/shm
/dev/xvda1 109G 93G 16G 86% /
/dev/xvdf 196G 61M 186G 1% /home/ec2-user/SageMaker
此内存与 GPU 有何不同?如果 /dev/xvdf 中有 200GB,我可以直接使用它吗?在我的测试脚本中我尝试了
model = BertForSequenceClassification.from_pretrained(args.model_name,num_labels=args.num_labels).to("cpu")
但这只是给出了同样的错误
CUDA out of memory
错误表示您的 GPU RAM(随机存取存储器)已满。这与您设备上的存储空间不同(这是您通过 df -h
命令获得的信息)。
此内存由您加载到 GPU 内存中的模型占用,与您的数据集大小无关。模型所需的 GPU 内存至少是模型实际大小的两倍,但很可能接近 4 倍(初始权重、检查点、梯度、优化器状态等)。
您可以尝试的事情:
- 为实例提供更多 GPU 内存
- 减少批量大小
- 使用不同的(更小的)模型
我一直在尝试使用 AWS Sagemaker 训练 BertSequenceForClassification 模型。我正在使用拥抱面估计器。但我一直收到错误消息:RuntimeError: CUDA out of memory. Tried to allocate 192.00 MiB (GPU 0; 11.17 GiB total capacity; 10.73 GiB already allocated; 87.88 MiB free; 10.77 GiB reserved in total by PyTorch)
相同的代码 运行 在我的笔记本电脑上没问题。
- 我如何查看占用那 10GB 内存的是什么?我的数据集非常小 (68kb),我的批量大小 (8) 和时期 (1) 也是如此。当我 运行 nvidia-smi 时,我只能看到“No processes 运行ning”并且 GPU 内存使用为零。当我 运行
print(torch.cuda.memory_summary(device=None, abbreviated=False))
从我的训练脚本中(就在它抛出错误之前)它打印
|===========================================================================|
| PyTorch CUDA memory summary, device ID 0 |
|---------------------------------------------------------------------------|
| CUDA OOMs: 0 | cudaMalloc retries: 0 |
|===========================================================================|
| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
|---------------------------------------------------------------------------|
| Allocated memory | 0 B | 0 B | 0 B | 0 B |
| from large pool | 0 B | 0 B | 0 B | 0 B |
| from small pool | 0 B | 0 B | 0 B | 0 B |
|---------------------------------------------------------------------------|
| Active memory | 0 B | 0 B | 0 B | 0 B |
| from large pool | 0 B | 0 B | 0 B | 0 B |
| from small pool | 0 B | 0 B | 0 B | 0 B |
|---------------------------------------------------------------------------|
| GPU reserved memory | 0 B | 0 B | 0 B | 0 B |
| from large pool | 0 B | 0 B | 0 B | 0 B |
| from small pool | 0 B | 0 B | 0 B | 0 B |
|---------------------------------------------------------------------------|
| Non-releasable memory | 0 B | 0 B | 0 B | 0 B |
| from large pool | 0 B | 0 B | 0 B | 0 B |
| from small pool | 0 B | 0 B | 0 B | 0 B |
|---------------------------------------------------------------------------|
| Allocations | 0 | 0 | 0 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 0 | 0 | 0 | 0 |
|---------------------------------------------------------------------------|
| Active allocs | 0 | 0 | 0 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 0 | 0 | 0 | 0 |
|---------------------------------------------------------------------------|
| GPU reserved segments | 0 | 0 | 0 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 0 | 0 | 0 | 0 |
|---------------------------------------------------------------------------|
| Non-releasable allocs | 0 | 0 | 0 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 0 | 0 | 0 | 0 |
|===========================================================================|
但我不知道它是什么意思或如何解释它
- 当我 运行
!df -h
我可以看到:
Filesystem Size Used Avail Use% Mounted on
devtmpfs 30G 72K 30G 1% /dev
tmpfs 30G 0 30G 0% /dev/shm
/dev/xvda1 109G 93G 16G 86% /
/dev/xvdf 196G 61M 186G 1% /home/ec2-user/SageMaker
此内存与 GPU 有何不同?如果 /dev/xvdf 中有 200GB,我可以直接使用它吗?在我的测试脚本中我尝试了
model = BertForSequenceClassification.from_pretrained(args.model_name,num_labels=args.num_labels).to("cpu")
但这只是给出了同样的错误
CUDA out of memory
错误表示您的 GPU RAM(随机存取存储器)已满。这与您设备上的存储空间不同(这是您通过 df -h
命令获得的信息)。
此内存由您加载到 GPU 内存中的模型占用,与您的数据集大小无关。模型所需的 GPU 内存至少是模型实际大小的两倍,但很可能接近 4 倍(初始权重、检查点、梯度、优化器状态等)。
您可以尝试的事情:
- 为实例提供更多 GPU 内存
- 减少批量大小
- 使用不同的(更小的)模型