RNN 中的梯度累积
Gradient accumulation in an RNN
当 运行 一个大型 RNN 网络时,我 运行 遇到了一些内存问题 (GPU),但我想保持我的批量大小合理,所以我想尝试梯度累积。在一次性预测输出的网络中,这似乎是不言而喻的,但在 RNN 中,您为每个输入步骤执行多次前向传递。因此,我担心我的实施不会按预期工作。我从用户 albanD 的优秀示例 here 开始,但我认为在使用 RNN 时应修改它们。我认为这是因为你积累了更多的梯度,因为你对每个序列进行了多次转发。
我当前的实现看起来像这样,同时允许在 PyTorch 1.6 中使用 AMP,这似乎很重要——一切都需要在正确的地方调用。请注意,这只是一个抽象版本,它可能看起来像很多代码,但主要是注释。
def train(epochs):
"""Main training loop. Loops for `epoch` number of epochs. Calls `process`."""
for epoch in range(1, epochs + 1):
train_loss = process("train")
valid_loss = process("valid")
# ... check whether we improved over earlier epochs
if lr_scheduler:
lr_scheduler.step(valid_loss)
def process(do):
"""Do a single epoch run through the dataloader of the training or validation set.
Also takes care of optimizing the model after every `gradient_accumulation_steps` steps.
Calls `step` for each batch where it gets the loss from."""
if do == "train":
model.train()
torch.set_grad_enabled(True)
else:
model.eval()
torch.set_grad_enabled(False)
loss = 0.
for batch_idx, batch in enumerate(dataloaders[do]):
step_loss, avg_step_loss = step(batch)
loss += avg_step_loss
if do == "train":
if amp:
scaler.scale(step_loss).backward()
if (batch_idx + 1) % gradient_accumulation_steps == 0:
# Unscales the gradients of optimizer's assigned params in-place
scaler.unscale_(optimizer)
# clip in-place
clip_grad_norm_(model.parameters(), 2.0)
scaler.step(optimizer)
scaler.update()
model.zero_grad()
else:
step_loss.backward()
if (batch_idx + 1) % gradient_accumulation_steps == 0:
clip_grad_norm_(model.parameters(), 2.0)
optimizer.step()
model.zero_grad()
# return average loss
return loss / len(dataloaders[do])
def step():
"""Processes one step (one batch) by forwarding multiple times to get a final prediction for a given sequence."""
# do stuff... init hidden state and first input etc.
loss = torch.tensor([0.]).to(device)
for i in range(target_len):
with torch.cuda.amp.autocast(enabled=amp):
# overwrite previous decoder_hidden
output, decoder_hidden = model(decoder_input, decoder_hidden)
# compute loss between predicted classes (bs x classes) and correct classes for _this word_
item_loss = criterion(output, target_tensor[i])
# We calculate the gradients for the average step so that when
# we do take an optimizer.step, it takes into account the mean step_loss
# across batches. So basically (A+B+C)/3 = A/3 + B/3 + C/3
loss += (item_loss / gradient_accumulation_steps)
topv, topi = output.topk(1)
decoder_input = topi.detach()
return loss, loss.item() / target_len
以上似乎并没有像我希望的那样工作,即它仍然很快遇到内存不足的问题。我想原因是step
已经积累了这么多信息,但我不确定。
为简单起见,我将只处理amp
启用的渐变累积,没有amp的想法是一样的。您的步骤在 amp
下显示了 运行s,所以让我们坚持下去。
step
在PyTorch documentation about amp中你有一个梯度累积的例子。您应该在 step
内进行。每次 运行 loss.backward()
梯度都会在张量叶内累积,可以通过 optimizer
进行优化。因此,你的 step
应该是这样的(见评论):
def step():
"""Processes one step (one batch) by forwarding multiple times to get a final prediction for a given sequence."""
# You should not accumulate loss on `GPU`, RAM and CPU is better for that
# Use GPU only for calculations, not for gathering metrics etc.
loss = 0
for i in range(target_len):
with torch.cuda.amp.autocast(enabled=amp):
# where decoder_input is from?
# I assume there is one in real code
output, decoder_hidden = model(decoder_input, decoder_hidden)
# Here you divide by accumulation steps
item_loss = criterion(output, target_tensor[i]) / (
gradient_accumulation_steps * target_len
)
scaler.scale(item_loss).backward()
loss += item_loss.detach().item()
# Not sure what was topv for here
_, topi = output.topk(1)
decoder_input = topi.detach()
# No need to return loss now as we did backward above
return loss / target_len
就像你 detach
decoder_input
一样(所以它就像是全新的没有历史记录的隐藏输入,参数将基于此进行优化,不是基于所有 运行s) 过程中不需要 backward
。另外,你可能不需要 decoder_hidden
,如果它没有传递到网络,torch.tensor
填充零被隐式传递。
我们还应该除以 gradient_accumulation_steps * target_len
,因为这是在单个优化步骤之前我们将 运行 多少 backward
。
因为你的一些变量是 ill-defined 我假设你只是对正在发生的事情做了一个计划。
另外,如果你想保留历史,你不应该 detach
decoder_input
,在这种情况下,它看起来像这样:
def step():
"""Processes one step (one batch) by forwarding multiple times to get a final prediction for a given sequence."""
loss = 0
for i in range(target_len):
with torch.cuda.amp.autocast(enabled=amp):
output, decoder_hidden = model(decoder_input, decoder_hidden)
item_loss = criterion(output, target_tensor[i]) / (
gradient_accumulation_steps * target_len
)
_, topi = output.topk(1)
decoder_input = topi
loss += item_loss
scaler.scale(loss).backward()
return loss.detach().cpu() / target_len
这有效地通过 RNN 多次并且可能会引发 OOM,不确定您在这里的目的是什么。如果是这种情况,您将无能为力,因为 RNN 计算太长而无法放入 GPU。
process
只显示了此代码的相关部分,因此它将是:
loss = 0.0
for batch_idx, batch in enumerate(dataloaders[do]):
# Here everything is detached from graph so we're safe
avg_step_loss = step(batch)
loss += avg_step_loss
if do == "train":
if (batch_idx + 1) % gradient_accumulation_steps == 0:
# You can use unscale as in the example in PyTorch's docs
# just like you did
scaler.unscale_(optimizer)
# clip in-place
clip_grad_norm_(model.parameters(), 2.0)
scaler.step(optimizer)
scaler.update()
# IMO in this case optimizer.zero_grad is more readable
# but it's a nitpicking
optimizer.zero_grad()
# return average loss
return loss / len(dataloaders[do])
Question-like
[...] in an RNN you do multiple forward passes for each input step.
Because of that, I fear that my implementation does not work as
intended.
没关系。对于每个前向,您通常应该向后做一个(这里似乎是这种情况,请参阅可能选项的步骤)。之后,我们(通常)不需要连接到图的损失,因为我们已经执行了backpropagation
,获得了梯度并准备好优化参数。
That loss needs to have history, as it goes back to the process loop
where backward will be called on it.
无需调用 backward
中显示的过程。
当 运行 一个大型 RNN 网络时,我 运行 遇到了一些内存问题 (GPU),但我想保持我的批量大小合理,所以我想尝试梯度累积。在一次性预测输出的网络中,这似乎是不言而喻的,但在 RNN 中,您为每个输入步骤执行多次前向传递。因此,我担心我的实施不会按预期工作。我从用户 albanD 的优秀示例 here 开始,但我认为在使用 RNN 时应修改它们。我认为这是因为你积累了更多的梯度,因为你对每个序列进行了多次转发。
我当前的实现看起来像这样,同时允许在 PyTorch 1.6 中使用 AMP,这似乎很重要——一切都需要在正确的地方调用。请注意,这只是一个抽象版本,它可能看起来像很多代码,但主要是注释。
def train(epochs):
"""Main training loop. Loops for `epoch` number of epochs. Calls `process`."""
for epoch in range(1, epochs + 1):
train_loss = process("train")
valid_loss = process("valid")
# ... check whether we improved over earlier epochs
if lr_scheduler:
lr_scheduler.step(valid_loss)
def process(do):
"""Do a single epoch run through the dataloader of the training or validation set.
Also takes care of optimizing the model after every `gradient_accumulation_steps` steps.
Calls `step` for each batch where it gets the loss from."""
if do == "train":
model.train()
torch.set_grad_enabled(True)
else:
model.eval()
torch.set_grad_enabled(False)
loss = 0.
for batch_idx, batch in enumerate(dataloaders[do]):
step_loss, avg_step_loss = step(batch)
loss += avg_step_loss
if do == "train":
if amp:
scaler.scale(step_loss).backward()
if (batch_idx + 1) % gradient_accumulation_steps == 0:
# Unscales the gradients of optimizer's assigned params in-place
scaler.unscale_(optimizer)
# clip in-place
clip_grad_norm_(model.parameters(), 2.0)
scaler.step(optimizer)
scaler.update()
model.zero_grad()
else:
step_loss.backward()
if (batch_idx + 1) % gradient_accumulation_steps == 0:
clip_grad_norm_(model.parameters(), 2.0)
optimizer.step()
model.zero_grad()
# return average loss
return loss / len(dataloaders[do])
def step():
"""Processes one step (one batch) by forwarding multiple times to get a final prediction for a given sequence."""
# do stuff... init hidden state and first input etc.
loss = torch.tensor([0.]).to(device)
for i in range(target_len):
with torch.cuda.amp.autocast(enabled=amp):
# overwrite previous decoder_hidden
output, decoder_hidden = model(decoder_input, decoder_hidden)
# compute loss between predicted classes (bs x classes) and correct classes for _this word_
item_loss = criterion(output, target_tensor[i])
# We calculate the gradients for the average step so that when
# we do take an optimizer.step, it takes into account the mean step_loss
# across batches. So basically (A+B+C)/3 = A/3 + B/3 + C/3
loss += (item_loss / gradient_accumulation_steps)
topv, topi = output.topk(1)
decoder_input = topi.detach()
return loss, loss.item() / target_len
以上似乎并没有像我希望的那样工作,即它仍然很快遇到内存不足的问题。我想原因是step
已经积累了这么多信息,但我不确定。
为简单起见,我将只处理amp
启用的渐变累积,没有amp的想法是一样的。您的步骤在 amp
下显示了 运行s,所以让我们坚持下去。
step
在PyTorch documentation about amp中你有一个梯度累积的例子。您应该在 step
内进行。每次 运行 loss.backward()
梯度都会在张量叶内累积,可以通过 optimizer
进行优化。因此,你的 step
应该是这样的(见评论):
def step():
"""Processes one step (one batch) by forwarding multiple times to get a final prediction for a given sequence."""
# You should not accumulate loss on `GPU`, RAM and CPU is better for that
# Use GPU only for calculations, not for gathering metrics etc.
loss = 0
for i in range(target_len):
with torch.cuda.amp.autocast(enabled=amp):
# where decoder_input is from?
# I assume there is one in real code
output, decoder_hidden = model(decoder_input, decoder_hidden)
# Here you divide by accumulation steps
item_loss = criterion(output, target_tensor[i]) / (
gradient_accumulation_steps * target_len
)
scaler.scale(item_loss).backward()
loss += item_loss.detach().item()
# Not sure what was topv for here
_, topi = output.topk(1)
decoder_input = topi.detach()
# No need to return loss now as we did backward above
return loss / target_len
就像你 detach
decoder_input
一样(所以它就像是全新的没有历史记录的隐藏输入,参数将基于此进行优化,不是基于所有 运行s) 过程中不需要 backward
。另外,你可能不需要 decoder_hidden
,如果它没有传递到网络,torch.tensor
填充零被隐式传递。
我们还应该除以 gradient_accumulation_steps * target_len
,因为这是在单个优化步骤之前我们将 运行 多少 backward
。
因为你的一些变量是 ill-defined 我假设你只是对正在发生的事情做了一个计划。
另外,如果你想保留历史,你不应该 detach
decoder_input
,在这种情况下,它看起来像这样:
def step():
"""Processes one step (one batch) by forwarding multiple times to get a final prediction for a given sequence."""
loss = 0
for i in range(target_len):
with torch.cuda.amp.autocast(enabled=amp):
output, decoder_hidden = model(decoder_input, decoder_hidden)
item_loss = criterion(output, target_tensor[i]) / (
gradient_accumulation_steps * target_len
)
_, topi = output.topk(1)
decoder_input = topi
loss += item_loss
scaler.scale(loss).backward()
return loss.detach().cpu() / target_len
这有效地通过 RNN 多次并且可能会引发 OOM,不确定您在这里的目的是什么。如果是这种情况,您将无能为力,因为 RNN 计算太长而无法放入 GPU。
process
只显示了此代码的相关部分,因此它将是:
loss = 0.0
for batch_idx, batch in enumerate(dataloaders[do]):
# Here everything is detached from graph so we're safe
avg_step_loss = step(batch)
loss += avg_step_loss
if do == "train":
if (batch_idx + 1) % gradient_accumulation_steps == 0:
# You can use unscale as in the example in PyTorch's docs
# just like you did
scaler.unscale_(optimizer)
# clip in-place
clip_grad_norm_(model.parameters(), 2.0)
scaler.step(optimizer)
scaler.update()
# IMO in this case optimizer.zero_grad is more readable
# but it's a nitpicking
optimizer.zero_grad()
# return average loss
return loss / len(dataloaders[do])
Question-like
[...] in an RNN you do multiple forward passes for each input step. Because of that, I fear that my implementation does not work as intended.
没关系。对于每个前向,您通常应该向后做一个(这里似乎是这种情况,请参阅可能选项的步骤)。之后,我们(通常)不需要连接到图的损失,因为我们已经执行了backpropagation
,获得了梯度并准备好优化参数。
That loss needs to have history, as it goes back to the process loop where backward will be called on it.
无需调用 backward
中显示的过程。