了解为什么在推理、反向传播和模型更新期间会发生内存分配

Understanding why memory allocation occurs during inference, backpropagation, and model update

在追踪 GPU OOM 错误的过程中,我在我的 Pytorch 代码中做了以下检查点(运行 on Google Colab P100):

learning_rate = 0.001
num_epochs = 50

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

print('check 1')
!nvidia-smi | grep MiB | awk '{print   }'

model = MyModel()

print('check 2')
!nvidia-smi | grep MiB | awk '{print   }'

model = model.to(device)

print('check 3')
!nvidia-smi | grep MiB | awk '{print   }'

optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

print('check 4')
!nvidia-smi | grep MiB | awk '{print   }'

for epoch in range(num_epochs):
    train_running_loss = 0.0
    train_accuracy = 0.0

    model = model.train()

    print('check 5')
    !nvidia-smi | grep MiB | awk '{print   }'

    ## training step
    for i, (name, output_array, input) in enumerate(trainloader):
        
        output_array = output_array.to(device)
        input = input.to(device)
        comb = torch.zeros(1,1,100,1632).to(device)

        print('check 6')
        !nvidia-smi | grep MiB | awk '{print   }'

        ## forward + backprop + loss
        output = model(input, comb)

        print('check 7')
        !nvidia-smi | grep MiB | awk '{print   }'

        loss = my_loss(output, output_array)

        print('check 8')
        !nvidia-smi | grep MiB | awk '{print   }'

        optimizer.zero_grad()

        print('check 9')
        !nvidia-smi | grep MiB | awk '{print   }'

        loss.backward()

        print('check 10')
        !nvidia-smi | grep MiB | awk '{print   }'

        ## update model params
        optimizer.step()

        print('check 11')
        !nvidia-smi | grep MiB | awk '{print   }'

        train_running_loss += loss.detach().item()

        print('check 12')
        !nvidia-smi | grep MiB | awk '{print   }'

        temp = get_accuracy(output, output_array)

        print('check 13')
        !nvidia-smi | grep MiB | awk '{print   }'

        train_accuracy += temp     

具有以下输出:

check 1
2MiB/16160MiB
check 2
2MiB/16160MiB
check 3
3769MiB/16160MiB
check 4
3769MiB/16160MiB
check 5
3769MiB/16160MiB
check 6
3847MiB/16160MiB
check 7
6725MiB/16160MiB
check 8
6725MiB/16160MiB
check 9
6725MiB/16160MiB
check 10
9761MiB/16160MiB
check 11
16053MiB/16160MiB
check 12
16053MiB/16160MiB
check 13
16053MiB/16160MiB
check 6
16053MiB/16160MiB
check 7
16071MiB/16160MiB
check 8
16071MiB/16160MiB
check 9
16071MiB/16160MiB
check 10
16071MiB/16160MiB
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-11-f566d09448f9> in <module>()
     65 
     66         ## update model params
---> 67         optimizer.step()
     68 
     69         print('check 11')

3 frames
/usr/local/lib/python3.7/dist-packages/torch/optim/optimizer.py in wrapper(*args, **kwargs)
     86                 profile_name = "Optimizer.step#{}.step".format(obj.__class__.__name__)
     87                 with torch.autograd.profiler.record_function(profile_name):
---> 88                     return func(*args, **kwargs)
     89             return wrapper
     90 

/usr/local/lib/python3.7/dist-packages/torch/autograd/grad_mode.py in decorate_context(*args, **kwargs)
     26         def decorate_context(*args, **kwargs):
     27             with self.__class__():
---> 28                 return func(*args, **kwargs)
     29         return cast(F, decorate_context)
     30 

/usr/local/lib/python3.7/dist-packages/torch/optim/adam.py in step(self, closure)
    116                    lr=group['lr'],
    117                    weight_decay=group['weight_decay'],
--> 118                    eps=group['eps'])
    119         return loss

/usr/local/lib/python3.7/dist-packages/torch/optim/_functional.py in adam(params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, beta1, beta2, lr, weight_decay, eps)
     92             denom = (max_exp_avg_sqs[i].sqrt() / math.sqrt(bias_correction2)).add_(eps)
     93         else:
---> 94             denom = (exp_avg_sq.sqrt() / math.sqrt(bias_correction2)).add_(eps)
     95 
     96         step_size = lr / bias_correction1

RuntimeError: CUDA out of memory. Tried to allocate 2.32 GiB (GPU 0; 15.78 GiB total capacity; 11.91 GiB already allocated; 182.75 MiB free; 14.26 GiB reserved in total by PyTorch)

model = model.to(device) 创造 3.7G 内存对我来说很有意义。

但为什么 运行 模型 output = model(input, comb) 又创建了 3G 内存?
然后loss.backward()再创建3G内存?
然后optimizer.step()再创建6.3G内存?

如果有人能解释 PyTorch GPU 内存分配模型在此示例中的工作原理,我将不胜感激。

  • 推理

    默认情况下,对模型的推理将分配内存来存储每一层的激活(激活,如 中间层输入 ).这是反向传播所需要的,其中这些张量用于计算梯度。一个简单但有效的例子是由 f: x -> x² 定义的函数。这里,df/dx = 2x为了计算df/dx,你需要在内存中保留x

    如果您使用 torch.no_grad() 上下文管理器,您将允许 PyTorch 保存这些值从而节省内存。这在评估或测试您的模型时特别有用,即执行反向传播时。当然,你不能在训练中使用它!

  • 反向传播

    向后传递调用将在设备上分配额外的内存来存储每个参数的梯度值。只有叶张量节点(模型参数和输入)将其梯度存储在 grad 属性中。这就是为什么内存使用仅在推理和 backward 调用之间增加。

  • 模型参数更新

    由于您使用的是有状态优化器 (Adam),因此需要一些额外的内存来保存一些参数。阅读 related PyTorch forum post。如果您尝试使用无状态优化器(例如 SGD),您应该不会在 step 调用上有任何内存开销。


这三个步骤都有内存需求。总之,在您的设备上分配的内存将有效地取决于三个元素:

  1. 你的神经网络的大小:模型越大,内存中保存的层激活和梯度越多。

  2. 你是否在torch.no_grad上下文中:在这种情况下,只有模型的状态需要在内存中(不需要激活或梯度)。

  3. 使用的优化器类型:是有状态的(在参数更新期间保存一些 运行 估计值,还是无状态的(不需要)。

是否需要做回