了解 PyTorch 中的累积梯度

Question

我正在尝试理解 PyTorch 中梯度累积的内部工作原理。我的问题与这两个有些相关：

对第二个问题的已接受答案的评论表明，如果小批量太大而无法在单个前向传递中执行梯度更新，则可以使用累积梯度，因此必须分成多个子批。

考虑以下玩具示例：

import numpy as np
import torch


class ExampleLinear(torch.nn.Module):

    def __init__(self):
        super().__init__()
        # Initialize the weight at 1
        self.weight = torch.nn.Parameter(torch.Tensor([1]).float(),
                                         requires_grad=True)

    def forward(self, x):
        return self.weight * x


if __name__ == "__main__":
    # Example 1
    model = ExampleLinear()

    # Generate some data
    x = torch.from_numpy(np.array([4, 2])).float()
    y = 2 * x

    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

    y_hat = model(x)          # forward pass

    loss = (y - y_hat) ** 2
    loss = loss.mean()        # MSE loss

    loss.backward()           # backward pass

    optimizer.step()          # weight update

    print(model.weight.grad)  # tensor([-20.])
    print(model.weight)       # tensor([1.2000]

这正是人们所期望的结果。现在假设我们要利用梯度累积逐个样本地处理数据集：

    # Example 2: MSE sample-by-sample
    model2 = ExampleLinear()
    optimizer = torch.optim.SGD(model2.parameters(), lr=0.01)

    # Compute loss sample-by-sample, then average it over all samples
    loss = []
    for k in range(len(y)):
        y_hat = model2(x[k])
        loss.append((y[k] - y_hat) ** 2)
    loss = sum(loss) / len(y)

    loss.backward()     # backward pass
    optimizer.step()    # weight update

    print(model2.weight.grad)  # tensor([-20.])
    print(model2.weight)       # tensor([1.2000]

和预期的一样，调用.backward()方法时计算了梯度。

最后是我的问题：到底发生了什么 'under the hood'？

我的理解是，计算图是动态更新的，从 <PowBackward> 到 <AddBackward> <DivBackward> 对 loss 变量的操作，并且没有关于数据的信息用于每个前向传递的信息保留在除 loss 张量之外的任何地方，该张量可以在向后传递之前进行更新。

以上段落的推理有什么注意事项吗？最后，在使用梯度累积时是否有任何最佳实践可以遵循（即我在 示例 2 中使用的方法是否会以某种方式适得其反）？

Answer 1

你实际上并不是在累积梯度。如果你有一个单一的 .backward() 调用，只是离开 optimizer.zero_grad() 没有效果，因为梯度已经开始为零（技术上 None 但它们将是自动初始化为零）。

你的两个版本之间的唯一区别是你如何计算最终损失。第二个示例的 for 循环执行与第一个示例中 PyTorch 相同的计算，但是您单独执行它们，并且 PyTorch 无法优化（并行化和矢量化）您的 for 循环，这在 GPU 上产生了特别惊人的差异，假设张量并不小。

在开始梯度累积之前，让我们从您的问题开始：

Finally to my question: what exactly happens 'under the hood'?

当且仅当其中一个操作数已经是计算图的一部分时，张量上的每个操作都会在计算图中被跟踪。当您设置张量的 requires_grad=True 时，它会创建一个具有单个顶点的计算图，即张量本身，它将在图中保持为叶子。使用该张量的任何操作都会创建一个新顶点，这是操作的结果，因此从操作数到它有一条边，跟踪所执行的操作。

a = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(4.0)
c = a + b # => tensor(6., grad_fn=<AddBackward0>)

a.requires_grad # => True
a.is_leaf # => True

b.requires_grad # => False
b.is_leaf # => True

c.requires_grad # => True
c.is_leaf # => False

每个中间张量自动需要梯度并且有一个grad_fn，这是计算关于其输入的偏导数的函数。多亏了链式法则，我们可以以相反的顺序遍历整个图来计算关于每个叶子的导数，这是我们想要优化的参数。这就是反向传播的思想，也称为 反向模式微分 。有关详细信息，我建议阅读 Calculus on Computational Graphs: Backpropagation.

PyTorch 使用了这个确切的想法，当您调用 loss.backward() 时，它以相反的顺序遍历图形，从 loss 开始，并计算每个顶点的导数。每当到达一片叶子时，该张量的计算导数将存储在其 .grad 属性中。

在您的第一个示例中，这将导致：

MeanBackward -> PowBackward -> SubBackward -> MulBackward`

第二个示例几乎相同，只是您手动计算均值，而不是使用单一路径来计算损失，而是为损失计算的每个元素设置多个路径。澄清一下，单一路径还计算每个元素的导数，但在内部，这再次打开了一些优化的可能性。

# Example 1
loss = (y - y_hat) ** 2
# => tensor([16.,  4.], grad_fn=<PowBackward0>)

# Example 2
loss = []
for k in range(len(y)):
    y_hat = model2(x[k])
    loss.append((y[k] - y_hat) ** 2)
loss
# => [tensor([16.], grad_fn=<PowBackward0>), tensor([4.], grad_fn=<PowBackward0>)]

在任何一种情况下，都会创建一个图形，它只反向传播一次，这就是它不被视为梯度累积的原因。

梯度累积

梯度累加是指在更新参数之前执行多次向后传递的情况。目标是让多个输入（批次）具有相同的模型参数，然后根据所有这些批次更新模型的参数，而不是在每个批次之后执行更新。

让我们重新审视一下您的示例。 x 的大小为 [2]，这是我们整个数据集的大小。出于某种原因，我们需要根据整个数据集计算梯度。使用 2 的批量大小时自然会出现这种情况，因为我们将同时拥有整个数据集。但是如果我们只能有大小为 1 的批次会怎样？我们可以运行它们单独并像往常一样在每批之后更新模型，但是我们不计算整个数据集的梯度。

我们需要做的是运行每个样本单独使用相同的模型参数，并在不更新模型的情况下计算梯度。现在您可能会想，这不是您在第二个版本中所做的吗？几乎，但不完全是，你的版本中存在一个关键问题，即你使用的内存量与第一个版本相同，因为你有相同的计算，因此计算图中的值数量相同。

我们如何释放内存？我们需要摆脱前一批的张量和计算图，因为它使用大量内存来跟踪反向传播所需的一切。调用.backward()时计算图自动销毁（除非指定retain_graph=True）。

def calculate_loss(x: torch.Tensor) -> torch.Tensor:
    y = 2 * x
    y_hat = model(x)
    loss = (y - y_hat) ** 2
    return loss.mean()


# With mulitple batches of size 1
batches = [torch.tensor([4.0]), torch.tensor([2.0])]

optimizer.zero_grad()
for i, batch in enumerate(batches):
    # The loss needs to be scaled, because the mean should be taken across the whole
    # dataset, which requires the loss to be divided by the number of batches.
    loss = calculate_loss(batch) / len(batches)
    loss.backward()
    print(f"Batch size 1 (batch {i}) - grad: {model.weight.grad}")
    print(f"Batch size 1 (batch {i}) - weight: {model.weight}")

# Updating the model only after all batches
optimizer.step()
print(f"Batch size 1 (final) - grad: {model.weight.grad}")
print(f"Batch size 1 (final) - weight: {model.weight}")

输出（为了便于阅读，我删除了包含消息的参数）：

Batch size 1 (batch 0) - grad: tensor([-16.]) Batch size 1 (batch 0) - weight: tensor([1.], requires_grad=True) Batch size 1 (batch 1) - grad: tensor([-20.]) Batch size 1 (batch 1) - weight: tensor([1.], requires_grad=True) Batch size 1 (final) - grad: tensor([-20.]) Batch size 1 (final) - weight: tensor([1.2000], requires_grad=True)

如您所见，模型对所有批次都保持相同的参数，而梯度是累积的，最后有一个更新。请注意，损失需要按批次缩放，以便在整个数据集上具有与使用单个批次相同的重要性。

虽然在此示例中，在执行更新之前使用了整个数据集，但您可以轻松更改它以在一定数量的批次后更新参数，但您必须记住在优化器步骤后将梯度归零被拿走。一般配方是：

accumulation_steps = 10 for i, batch in enumerate(batches): # Scale the loss to the mean of the accumulated batch size loss = calculate_loss(batch) / accumulation_steps loss.backward() if (i + 1) % accumulation_steps == 0: optimizer.step() # Reset gradients, for the next accumulated batches optimizer.zero_grad()

您可以在 HuggingFace - Training Neural Nets on Larger Batches: Practical Tips for 1-GPU, Multi-GPU & Distributed setups.
中找到处理大批量大小的方法和更多技术

Answer 2

感谢如此精彩。

补充一下。

计算图

导数

代码

import numpy as np
import torch


class ExampleLinear(torch.nn.Module):

    def __init__(self):
        super().__init__()
        # Initialize the weight at 1
        self.weight = torch.nn.Parameter(torch.Tensor([1]).float(),
                                         requires_grad=True)

    def forward(self, x):
        return self.weight * x


model = ExampleLinear()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)


def calculate_loss(x: torch.Tensor) -> torch.Tensor:
    y = 2 * x
    y_hat = model(x)
    temp1 = (y - y_hat)
    temp2 = temp1**2
    return temp2


# With mulitple batches of size 1
batches = [torch.tensor([4.0]), torch.tensor([2.0])]

optimizer.zero_grad()
for i, batch in enumerate(batches):
    # The loss needs to be scaled, because the mean should be taken across the whole
    # dataset, which requires the loss to be divided by the number of batches.
    temp2 = calculate_loss(batch)
    loss = temp2 / len(batches)
    loss.backward()
    print(f"Batch size 1 (batch {i}) - grad: {model.weight.grad}")
    print(f"Batch size 1 (batch {i}) - weight: {model.weight}")
    print("="*50)

# Updating the model only after all batches
optimizer.step()
print(f"Batch size 1 (final) - grad: {model.weight.grad}")
print(f"Batch size 1 (final) - weight: {model.weight}")

了解 PyTorch 中的累积梯度

Understanding accumulated gradients in PyTorch

python

deep-learning

pytorch

gradient-descent

梯度累积

计算图

导数

代码