net.zero_grad() 与 optim.zero_grad() 火炬

net.zero_grad() vs optim.zero_grad() pytorch

Here they mention the need to include optim.zero_grad() when training to zero the parameter gradients. My question is: Could I do as well net.zero_grad() and would that have the same effect? Or is it necessary to do optim.zero_grad(). Moreover, what happens if I do both? If I do none, then the gradients get accumulated, but what does that exactly mean? do they get added? In other words, what's the difference between doing optim.zero_grad() and net.zero_grad(). I am asking because here, line 115 他们使用 net.zero_grad() 这是我第一次看到，这是强化学习算法的一个实现，其中必须特别注意梯度，因为有多个网络和梯度，所以我想他们有理由做 net.zero_grad() 而不是 optim.zero_grad()。

net.zero_grad() 将其所有参数（包括子模块的参数）的梯度设置为零。如果您调用 optim.zero_grad() 将执行相同的操作，但对于已指定要优化的所有参数。如果您在优化器中仅使用 net.parameters()，例如optim = Adam(net.parameters(), lr=1e-3)，那么两者是等价的，因为它们包含完全相同的参数。

你可能有其他参数正在被同一个优化器优化，它们不是 net 的一部分，在这种情况下，你要么必须手动将它们的梯度设置为零，然后跟踪所有参数，或者您可以简单地调用 optim.zero_grad() 以确保正在优化的所有参数都将其梯度设置为零。

Moreover, what happens if I do both?

没什么，梯度将再次设置为零，但由于它们已经为零，所以完全没有区别。

If I do none, then the gradients get accumulated, but what does that exactly mean? do they get added?

是的，它们正在添加到现有的渐变中。在向后传递中，计算每个参数的梯度，然后将梯度添加到参数的梯度 (param.grad)。这允许你有多个反向传递，影响相同的参数，如果梯度被覆盖而不是被添加，这是不可能的。

例如，如果您需要更大的批次来提高训练的稳定性，但没有足够的内存来增加批次大小，则可以在多个批次中累积梯度。这在 PyTorch 中实现起来很简单，它本质上是停止 optim.zero_grad() 并延迟 optim.step() 直到你收集了足够的步骤，如 HuggingFace - Training Neural Nets on Larger Batches: Practical Tips for 1-GPU, Multi-GPU & Distributed setups.

所示

这种灵活性是以必须手动将梯度设置为零为代价的。坦率地说，一条线的成本非常低，尽管许多用户不会使用它，尤其是初学者可能会觉得很困惑。

net.zero_grad() 与 optim.zero_grad() 火炬

net.zero_grad() vs optim.zero_grad() pytorch

reinforcement-learning

pytorch