CUDA:128 个图像数据集出现内存不足错误

CUDA: Out of memory error on 128 images dataset

我正在尝试在 Google Colab on coco128 数据集上训练 YOLOR on coco128 数据集。 训练集包含 112 张图像。 验证集包含 8 张图像。 测试集包含8张图片。

但是,它会引发 cuda 内存不足错误。怎么会这样??数据集总共只有128张图片。

Using torch 1.7.0 CUDA:0 (Tesla T4, 15109MB)
Namespace(adam=False, batch_size=8, bucket='', cache_images=False, cfg='cfg/yolor_p6.cfg', data='data/coco128.yaml', device='0', epochs=300, evolve=False, exist_ok=False, global_rank=-1, hyp='./data/hyp.scratch.1280.yaml', image_weights=False, img_size=[1280, 1280], local_rank=-1, log_imgs=16, multi_scale=False, name='yolor_p6', noautoanchor=False, nosave=False, notest=False, project='runs/train', rect=False, resume=False, save_dir='runs/train/yolor_p613', single_cls=False, sync_bn=False, total_batch_size=8, weights='', workers=8, world_size=1)
Start Tensorboard with "tensorboard --logdir runs/train", view at http://localhost:6006/
2021-07-29 13:35:48.259076: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Hyperparameters {'lr0': 0.01, 'lrf': 0.2, 'momentum': 0.937, 'weight_decay': 0.0005, 'warmup_epochs': 3.0, 'warmup_momentum': 0.8, 'warmup_bias_lr': 0.1, 'box': 0.05, 'cls': 0.5, 'cls_pw': 1.0, 'obj': 1.0, 'obj_pw': 1.0, 'iou_t': 0.2, 'anchor_t': 4.0, 'fl_gamma': 0.0, 'hsv_h': 0.015, 'hsv_s': 0.7, 'hsv_v': 0.4, 'degrees': 0.0, 'translate': 0.5, 'scale': 0.5, 'shear': 0.0, 'perspective': 0.0, 'flipud': 0.0, 'fliplr': 0.5, 'mosaic': 1.0, 'mixup': 0.0}
Model Summary: 665 layers, 37265016 parameters, 37265016 gradients, 81.564040600 GFLOPS
Optimizer groups: 145 .bias, 145 conv.weight, 149 other
Scanning labels ../coco128/train2017.cache3 (110 found, 0 missing, 2 empty, 0 duplicate, for 112 images): 112it [00:00, 11214.18it/s]
Scanning labels ../coco128/val2017.cache3 (8 found, 0 missing, 0 empty, 0 duplicate, for 8 images): 8it [00:00, 4100.00it/s]
NumExpr defaulting to 2 threads.
Image sizes 1280 train, 1280 test
Using 2 dataloader workers
Logging results to runs/train/yolor_p613
Starting training for 300 epochs...

     Epoch   gpu_mem       box       obj       cls     total   targets  img_size
  0% 0/14 [00:00<?, ?it/s]Traceback (most recent call last):
  File "train.py", line 539, in <module>
    train(hyp, opt, device, tb_writer, wandb)
  File "train.py", line 289, in train
    pred = model(imgs)  # forward
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/content/drive/MyDrive/YOLOR/yolor/models/models.py", line 543, in forward
    return self.forward_once(x)
  File "/content/drive/MyDrive/YOLOR/yolor/models/models.py", line 604, in forward_once
    x = module(x)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/container.py", line 117, in forward
    input = module(input)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/activation.py", line 394, in forward
    return F.silu(input, inplace=self.inplace)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1741, in silu
    return torch._C._nn.silu(input)
RuntimeError: CUDA out of memory. Tried to allocate 100.00 MiB (GPU 0; 14.76 GiB total capacity; 13.70 GiB already allocated; 67.75 MiB free; 13.76 GiB reserved in total by PyTorch)
  0% 0/14 [00:03<?, ?it/s]

vRAM 的使用与有多少 train/val 个示例无关,而与模型、图像大小和批量大小有关。 1280x1280 是一个巨大的图像尺寸——在 16gb 的 GPU 上,你可能只能以 1 或 2 个批量大小进行训练。

要么使用较低的 resolution/smaller 模型、具有更多 vRAM 的 GPU,要么减小批处理大小。

也试试NVIDIA AMP