了解为什么在推理、反向传播和模型更新期间会发生内存分配
Understanding why memory allocation occurs during inference, backpropagation, and model update
在追踪 GPU OOM 错误的过程中,我在我的 Pytorch 代码中做了以下检查点(运行 on Google Colab P100):
learning_rate = 0.001
num_epochs = 50
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print('check 1')
!nvidia-smi | grep MiB | awk '{print }'
model = MyModel()
print('check 2')
!nvidia-smi | grep MiB | awk '{print }'
model = model.to(device)
print('check 3')
!nvidia-smi | grep MiB | awk '{print }'
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
print('check 4')
!nvidia-smi | grep MiB | awk '{print }'
for epoch in range(num_epochs):
train_running_loss = 0.0
train_accuracy = 0.0
model = model.train()
print('check 5')
!nvidia-smi | grep MiB | awk '{print }'
## training step
for i, (name, output_array, input) in enumerate(trainloader):
output_array = output_array.to(device)
input = input.to(device)
comb = torch.zeros(1,1,100,1632).to(device)
print('check 6')
!nvidia-smi | grep MiB | awk '{print }'
## forward + backprop + loss
output = model(input, comb)
print('check 7')
!nvidia-smi | grep MiB | awk '{print }'
loss = my_loss(output, output_array)
print('check 8')
!nvidia-smi | grep MiB | awk '{print }'
optimizer.zero_grad()
print('check 9')
!nvidia-smi | grep MiB | awk '{print }'
loss.backward()
print('check 10')
!nvidia-smi | grep MiB | awk '{print }'
## update model params
optimizer.step()
print('check 11')
!nvidia-smi | grep MiB | awk '{print }'
train_running_loss += loss.detach().item()
print('check 12')
!nvidia-smi | grep MiB | awk '{print }'
temp = get_accuracy(output, output_array)
print('check 13')
!nvidia-smi | grep MiB | awk '{print }'
train_accuracy += temp
具有以下输出:
check 1
2MiB/16160MiB
check 2
2MiB/16160MiB
check 3
3769MiB/16160MiB
check 4
3769MiB/16160MiB
check 5
3769MiB/16160MiB
check 6
3847MiB/16160MiB
check 7
6725MiB/16160MiB
check 8
6725MiB/16160MiB
check 9
6725MiB/16160MiB
check 10
9761MiB/16160MiB
check 11
16053MiB/16160MiB
check 12
16053MiB/16160MiB
check 13
16053MiB/16160MiB
check 6
16053MiB/16160MiB
check 7
16071MiB/16160MiB
check 8
16071MiB/16160MiB
check 9
16071MiB/16160MiB
check 10
16071MiB/16160MiB
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-11-f566d09448f9> in <module>()
65
66 ## update model params
---> 67 optimizer.step()
68
69 print('check 11')
3 frames
/usr/local/lib/python3.7/dist-packages/torch/optim/optimizer.py in wrapper(*args, **kwargs)
86 profile_name = "Optimizer.step#{}.step".format(obj.__class__.__name__)
87 with torch.autograd.profiler.record_function(profile_name):
---> 88 return func(*args, **kwargs)
89 return wrapper
90
/usr/local/lib/python3.7/dist-packages/torch/autograd/grad_mode.py in decorate_context(*args, **kwargs)
26 def decorate_context(*args, **kwargs):
27 with self.__class__():
---> 28 return func(*args, **kwargs)
29 return cast(F, decorate_context)
30
/usr/local/lib/python3.7/dist-packages/torch/optim/adam.py in step(self, closure)
116 lr=group['lr'],
117 weight_decay=group['weight_decay'],
--> 118 eps=group['eps'])
119 return loss
/usr/local/lib/python3.7/dist-packages/torch/optim/_functional.py in adam(params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, beta1, beta2, lr, weight_decay, eps)
92 denom = (max_exp_avg_sqs[i].sqrt() / math.sqrt(bias_correction2)).add_(eps)
93 else:
---> 94 denom = (exp_avg_sq.sqrt() / math.sqrt(bias_correction2)).add_(eps)
95
96 step_size = lr / bias_correction1
RuntimeError: CUDA out of memory. Tried to allocate 2.32 GiB (GPU 0; 15.78 GiB total capacity; 11.91 GiB already allocated; 182.75 MiB free; 14.26 GiB reserved in total by PyTorch)
model = model.to(device)
创造 3.7G 内存对我来说很有意义。
但为什么 运行 模型 output = model(input, comb)
又创建了 3G 内存?
然后loss.backward()
再创建3G内存?
然后optimizer.step()
再创建6.3G内存?
如果有人能解释 PyTorch GPU 内存分配模型在此示例中的工作原理,我将不胜感激。
推理
默认情况下,对模型的推理将分配内存来存储每一层的激活(激活,如 中间层输入 ).这是反向传播所需要的,其中这些张量用于计算梯度。一个简单但有效的例子是由 f: x -> x²
定义的函数。这里,df/dx = 2x
,即为了计算df/dx
,你需要在内存中保留x
。
如果您使用 torch.no_grad()
上下文管理器,您将允许 PyTorch 不 保存这些值从而节省内存。这在评估或测试您的模型时特别有用,即执行反向传播时。当然,你不能在训练中使用它!
反向传播
向后传递调用将在设备上分配额外的内存来存储每个参数的梯度值。只有叶张量节点(模型参数和输入)将其梯度存储在 grad
属性中。这就是为什么内存使用仅在推理和 backward
调用之间增加。
模型参数更新
由于您使用的是有状态优化器 (Adam),因此需要一些额外的内存来保存一些参数。阅读 related PyTorch forum post。如果您尝试使用无状态优化器(例如 SGD),您应该不会在 step
调用上有任何内存开销。
这三个步骤都有内存需求。总之,在您的设备上分配的内存将有效地取决于三个元素:
你的神经网络的大小:模型越大,内存中保存的层激活和梯度越多。
你是否在torch.no_grad
上下文中:在这种情况下,只有模型的状态需要在内存中(不需要激活或梯度)。
使用的优化器类型:是有状态的(在参数更新期间保存一些 运行 估计值,还是无状态的(不需要)。
是否需要做回
在追踪 GPU OOM 错误的过程中,我在我的 Pytorch 代码中做了以下检查点(运行 on Google Colab P100):
learning_rate = 0.001
num_epochs = 50
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print('check 1')
!nvidia-smi | grep MiB | awk '{print }'
model = MyModel()
print('check 2')
!nvidia-smi | grep MiB | awk '{print }'
model = model.to(device)
print('check 3')
!nvidia-smi | grep MiB | awk '{print }'
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
print('check 4')
!nvidia-smi | grep MiB | awk '{print }'
for epoch in range(num_epochs):
train_running_loss = 0.0
train_accuracy = 0.0
model = model.train()
print('check 5')
!nvidia-smi | grep MiB | awk '{print }'
## training step
for i, (name, output_array, input) in enumerate(trainloader):
output_array = output_array.to(device)
input = input.to(device)
comb = torch.zeros(1,1,100,1632).to(device)
print('check 6')
!nvidia-smi | grep MiB | awk '{print }'
## forward + backprop + loss
output = model(input, comb)
print('check 7')
!nvidia-smi | grep MiB | awk '{print }'
loss = my_loss(output, output_array)
print('check 8')
!nvidia-smi | grep MiB | awk '{print }'
optimizer.zero_grad()
print('check 9')
!nvidia-smi | grep MiB | awk '{print }'
loss.backward()
print('check 10')
!nvidia-smi | grep MiB | awk '{print }'
## update model params
optimizer.step()
print('check 11')
!nvidia-smi | grep MiB | awk '{print }'
train_running_loss += loss.detach().item()
print('check 12')
!nvidia-smi | grep MiB | awk '{print }'
temp = get_accuracy(output, output_array)
print('check 13')
!nvidia-smi | grep MiB | awk '{print }'
train_accuracy += temp
具有以下输出:
check 1
2MiB/16160MiB
check 2
2MiB/16160MiB
check 3
3769MiB/16160MiB
check 4
3769MiB/16160MiB
check 5
3769MiB/16160MiB
check 6
3847MiB/16160MiB
check 7
6725MiB/16160MiB
check 8
6725MiB/16160MiB
check 9
6725MiB/16160MiB
check 10
9761MiB/16160MiB
check 11
16053MiB/16160MiB
check 12
16053MiB/16160MiB
check 13
16053MiB/16160MiB
check 6
16053MiB/16160MiB
check 7
16071MiB/16160MiB
check 8
16071MiB/16160MiB
check 9
16071MiB/16160MiB
check 10
16071MiB/16160MiB
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-11-f566d09448f9> in <module>()
65
66 ## update model params
---> 67 optimizer.step()
68
69 print('check 11')
3 frames
/usr/local/lib/python3.7/dist-packages/torch/optim/optimizer.py in wrapper(*args, **kwargs)
86 profile_name = "Optimizer.step#{}.step".format(obj.__class__.__name__)
87 with torch.autograd.profiler.record_function(profile_name):
---> 88 return func(*args, **kwargs)
89 return wrapper
90
/usr/local/lib/python3.7/dist-packages/torch/autograd/grad_mode.py in decorate_context(*args, **kwargs)
26 def decorate_context(*args, **kwargs):
27 with self.__class__():
---> 28 return func(*args, **kwargs)
29 return cast(F, decorate_context)
30
/usr/local/lib/python3.7/dist-packages/torch/optim/adam.py in step(self, closure)
116 lr=group['lr'],
117 weight_decay=group['weight_decay'],
--> 118 eps=group['eps'])
119 return loss
/usr/local/lib/python3.7/dist-packages/torch/optim/_functional.py in adam(params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, beta1, beta2, lr, weight_decay, eps)
92 denom = (max_exp_avg_sqs[i].sqrt() / math.sqrt(bias_correction2)).add_(eps)
93 else:
---> 94 denom = (exp_avg_sq.sqrt() / math.sqrt(bias_correction2)).add_(eps)
95
96 step_size = lr / bias_correction1
RuntimeError: CUDA out of memory. Tried to allocate 2.32 GiB (GPU 0; 15.78 GiB total capacity; 11.91 GiB already allocated; 182.75 MiB free; 14.26 GiB reserved in total by PyTorch)
model = model.to(device)
创造 3.7G 内存对我来说很有意义。
但为什么 运行 模型 output = model(input, comb)
又创建了 3G 内存?
然后loss.backward()
再创建3G内存?
然后optimizer.step()
再创建6.3G内存?
如果有人能解释 PyTorch GPU 内存分配模型在此示例中的工作原理,我将不胜感激。
推理
默认情况下,对模型的推理将分配内存来存储每一层的激活(激活,如 中间层输入 ).这是反向传播所需要的,其中这些张量用于计算梯度。一个简单但有效的例子是由
f: x -> x²
定义的函数。这里,df/dx = 2x
,即为了计算df/dx
,你需要在内存中保留x
。如果您使用
torch.no_grad()
上下文管理器,您将允许 PyTorch 不 保存这些值从而节省内存。这在评估或测试您的模型时特别有用,即执行反向传播时。当然,你不能在训练中使用它!反向传播
向后传递调用将在设备上分配额外的内存来存储每个参数的梯度值。只有叶张量节点(模型参数和输入)将其梯度存储在
grad
属性中。这就是为什么内存使用仅在推理和backward
调用之间增加。模型参数更新
由于您使用的是有状态优化器 (Adam),因此需要一些额外的内存来保存一些参数。阅读 related PyTorch forum post。如果您尝试使用无状态优化器(例如 SGD),您应该不会在
step
调用上有任何内存开销。
这三个步骤都有内存需求。总之,在您的设备上分配的内存将有效地取决于三个元素:
你的神经网络的大小:模型越大,内存中保存的层激活和梯度越多。
你是否在
torch.no_grad
上下文中:在这种情况下,只有模型的状态需要在内存中(不需要激活或梯度)。使用的优化器类型:是有状态的(在参数更新期间保存一些 运行 估计值,还是无状态的(不需要)。
是否需要做回