autograd.grad 和 autograd.backward 的区别?

Difference between autograd.grad and autograd.backward?

假设我有我的自定义损失函数,我想在我的神经网络的帮助下拟合一些微分方程的解。因此,在每次前向传递中,我都在计算我的神经网络的输出,然后通过将 MSE 与我想要适合我的感知器的预期方程一起计算损失。

现在我的疑问是:我应该使用 grad(loss) 还是应该使用 loss.backward() 进行反向传播来计算和更新我的梯度?

我知道在使用 loss.backward() 时我必须用 Variable 包装我的张量并且必须为我想要使用的变量 w.r.t 设置 requires_grad = True我损失的梯度。

所以我的问题是:

如果你能解释一下这两种方法的实际意义就更好了,因为每当我试图在网上找到它时,我都会被很多与我的项目不太相关的东西轰炸。

TLDR;两者都是执行梯度计算的两个不同接口:torch.autograd.grad 是非 mutable 而 torch.autograd.backward 是.


描述

torch.autograd模块是PyTorch的自动微分包。如文档中所述,它只需要对代码库进行最少的更改即可使用:

you only need to declare Tensors for which gradients should be computed with the requires_grad=True keyword.

torch.autograd 提供的两个用于梯度计算的主要函数是 torch.autograd.backwardtorch.autograd.grad:

torch.autograd.backward (source) torch.autograd.grad (source)
Description Computes the sum of gradients of given tensors with respect to graph leaves. Computes and returns the sum of gradients of outputs with respect to the inputs.
Header torch.autograd.backward(
tensors,
grad_tensors=None,
retain_graph=None,
create_graph=False,
grad_variables=None,
inputs=None)
torch.autograd.grad(
outputs,
inputs,
grad_outputs=None,
retain_graph=None,
create_graph=False,
only_inputs=True,
allow_unused=False)
Parameters - tensors – Tensors of which the derivative will be computed.
- grad_tensors – The "vector" in the Jacobian-vector product, usually gradients w.r.t. each element of corresponding tensors.
- retain_graph – If False, the graph used to compute the grad will be freed. [...]
- inputs – Inputs w.r.t. which the gradient be will be accumulated into .grad. All other Tensors will be ignored. If not provided, the gradient is accumulated into all the leaf Tensors that were used [...].
- outputs – outputs of the differentiated function.
- inputs – Inputs w.r.t. which the gradient will be returned (and not accumulated into .grad).
- grad_tensors – The "vector" in the Jacobian-vector product, usually gradients w.r.t. each element of corresponding tensors.
- retain_graph – If False, the graph used to compute the grad will be freed. [...].

用法示例

在高级用法方面,可以把torch.autograd.grad看成一个非mutable函数。正如上面的文档 table 中提到的,它不会在 grad 属性上累积梯度,而是 return 计算出的偏导数。相反,torch.autograd.backward 将能够通过更新叶节点的 grad 属性来改变张量,该函数不会 return 任何值。也就是说,后者在计算大量参数的梯度时更合适table

下面,我们将取两个输入(x1x2),用它们计算一个张量y,然后计算结果的偏导数w.r.t 两个输入, dL/dx1dL/dx2:

>>> x1 = torch.rand(1, requires_grad=True)
>>> x2 = torch.rand(1, requires_grad=True)
>>> x1, x2
(tensor(0.3939, grad_fn=<UnbindBackward>),
 tensor(0.7965, grad_fn=<UnbindBackward>))

推论:

>>> y = x1**2 + 5*x2
>>> y
tensor(4.1377, grad_fn=<AddBackward0>)

因为 y 是使用需要梯度的张量计算的(requires_grad=True)- *在 torch.no_grad 上下文之外.它将附加一个 grad_fn 函数。此回调用于反向传播到计算图以计算前面张量节点的梯度。

  • torch.autograd.grad:

    这里我们提供torch.ones_like(y)作为grad_outputs

    >>> torch.autograd.grad(y, (x1, x2), torch.ones_like(y))
    (tensor(0.7879), tensor(5.))
    

    上面的输出是一个元组,包含两个偏导数w.r.t。分别按出现顺序提供的输入, dL/dx1dL/dx2.

    这对应于以下计算:

    # dL/dx1 = dL/dy * dy/dx1 = grad_outputs @ 2*x1
    # dL/dx2 = dL/dy * dy/dx2 = grad_outputs @ 5
    
  • torch.autograd.backward:相比之下,它将通过更新已用于张量的 grad 来改变提供的张量计算输出张量并且需要梯度。它相当于torch.Tensor.backwardAPI。在这里,我们通过再次定义 x1x2y 来完成相同的示例。我们调用 backward:

    >>> # y.backward(torch.ones_like(y))
    >>> torch.autograd.backward(y, torch.ones_like(y))
    None
    

    然后您可以检索 x1.gradx2.grad 上的梯度:

    >>> x1.grad, x2.grad
    (tensor(0.7879), tensor(5.))
    

结论:两者执行相同的操作。它们是与 autograd 库交互并执行梯度计算的两个不同接口。后者,torch.autograd.backward(相当于torch.Tensor.backward),通常用在神经网络训练循环中,计算损失的偏导数w.r.t模型的每个参数。

您可以通过阅读我在 .

上做出的其他回答来了解有关 torch.autograd.grad 工作原理的更多信息

除了 Ivan 的回答,torch.autograd.grad 不将梯度累积到 .grad 中可以避免 multi-thread 场景中的竞速条件。

引用 PyTorch 文档 https://pytorch.org/docs/stable/notes/autograd.html#non-determinism

If you are calling backward() on multiple thread concurrently but with shared inputs (i.e. Hogwild CPU training). Since parameters are automatically shared across threads, gradient accumulation might become non-deterministic on backward calls across threads, because two backward calls might access and try to accumulate the same .grad attribute. This is technically not safe, and it might result in racing condition and the result might be invalid to use.

But this is expected pattern if you are using the multithreading approach to drive the whole training process but using shared parameters, user who use multithreading should have the threading model in mind and should expect this to happen. User could use the functional API torch.autograd.grad() to calculate the gradients instead of backward() to avoid non-determinism.

实现细节https://github.com/pytorch/pytorch/blob/7e3a694b23b383e38f5e39ef960ba8f374d22404/torch/csrc/autograd/functions/accumulate_grad.h