Cannot find in-place operation causing "RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation:"

Question

我对 PyTorch 比较陌生，我正在尝试从一篇学术论文中重现一种算法，该算法使用 Hessian 矩阵来近似项。我设置了一个玩具问题，以便我可以将完整 Hessian 矩阵的结果与近似值进行比较。我发现 this gist 并一直在使用它来计算算法的完整 Hessian 部分。

我收到错误："RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation."

我搜索了简单的示例代码、文档和许多关于此问题的论坛帖子，但找不到任何就地操作。任何帮助将不胜感激！

这是我的代码：

import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np

torch.set_printoptions(precision=20, linewidth=180)

def jacobian(y, x, create_graph=False):
    jac = []
    flat_y = y.reshape(-1)     
    grad_y = torch.zeros_like(flat_y)     

    for i in range(len(flat_y)):         
        grad_y[i] = 1.
        grad_x, = torch.autograd.grad(flat_y, x, grad_y, retain_graph=True, create_graph=create_graph)
        jac.append(grad_x.reshape(x.shape))
        grad_y[i] = 0.
    return torch.stack(jac).reshape(y.shape + x.shape)           

def hessian(y, x):
    return jacobian(jacobian(y, x, create_graph=True), x)                                             

def f(x):                                                                                             
    return x * x

np.random.seed(435537698)

num_dims = 2
num_samples = 3

X = [np.random.uniform(size=num_dims) for i in range(num_samples)]
print('X: \n{}\n\n'.format(X))

mean = torch.Tensor(np.mean(X, axis=0))
mean.requires_grad = True
print('mean: \n{}\n\n'.format(mean))

cov = torch.Tensor(np.cov(X, rowvar=False))
print('cov: \n{}\n\n'.format(cov))

with autograd.detect_anomaly():
    hessian_matrices = hessian(f(mean), mean)
    print('hessian: \n{}\n\n'.format(hessian_matrices))

这里是堆栈跟踪的输出：

X: 
[array([0.81700949, 0.17141617]), array([0.53579366, 0.31141496]), array([0.49756485, 0.97495776])]


mean: 
tensor([0.61678934097290039062, 0.48592963814735412598], requires_grad=True)


cov: 
tensor([[ 0.03043144382536411285, -0.05357056483626365662],
        [-0.05357056483626365662,  0.18426130712032318115]])


---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-3-5a1c492d2873> in <module>()
     42 
     43 with autograd.detect_anomaly():
---> 44     hessian_matrices = hessian(f(mean), mean)
     45     print('hessian: \n{}\n\n'.format(hessian_matrices))

2 frames
<ipython-input-3-5a1c492d2873> in hessian(y, x)
     21 
     22 def hessian(y, x):
---> 23     return jacobian(jacobian(y, x, create_graph=True), x)
     24 
     25 def f(x):

<ipython-input-3-5a1c492d2873> in jacobian(y, x, create_graph)
     15     for i in range(len(flat_y)):
     16         grad_y[i] = 1.
---> 17         grad_x, = torch.autograd.grad(flat_y, x, grad_y, retain_graph=True, create_graph=create_graph)
     18         jac.append(grad_x.reshape(x.shape))
     19         grad_y[i] = 0.

/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py in grad(outputs, inputs, grad_outputs, retain_graph, create_graph, only_inputs, allow_unused)
    155     return Variable._execution_engine.run_backward(
    156         outputs, grad_outputs, retain_graph, create_graph,
--> 157         inputs, allow_unused)
    158 
    159 

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [2]] is at version 4; expected version 3 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Answer 1

我真诚地认为这是 PyTorch 中的一个错误，但在发布错误后，我从 albanD 得到了一个很好的答案。 https://github.com/pytorch/pytorch/issues/36903#issuecomment-616671247 He also pointed out that https://discuss.pytorch.org/ 可以提问。

问题的出现是因为我们多次遍历计算图。虽然这里发生的事情超出了我的范围...

您的错误消息所指的就地编辑是显而易见的：grad_y[i] = 1. 和 grad_y[i] = 0.。在计算中一遍又一遍地求 grad_y 是造成麻烦的原因。如下重新定义 jacobian(...) 对我有用。

def jacobian(y, x, create_graph=False):
    jac = []
    flat_y = y.reshape(-1)
    for i in range(len(flat_y)):
        grad_y = torch.zeros_like(flat_y)
        grad_y[i] = 1.
        grad_x, = torch.autograd.grad(flat_y, x, grad_y, retain_graph=True, create_graph=create_graph)
        jac.append(grad_x.reshape(x.shape))
    return torch.stack(jac).reshape(y.shape + x.shape)

另一种方法可行，但对我来说更像是黑魔法，它让 jacobian(...) 保持原样，而是将 f(x) 重新定义为

def f(x):
    return x * x * 1

这也行。

Answer 2

对于未来的读者，标题中提到的 RuntimeError 可能会出现在比原作者更一般的设置中，例如在张量切片周围移动时 and/or 从列表推导中操纵张量，因为这是引导我来到这里的上下文（我的搜索引擎为 RuntimeError 返回的第一个 link）。

为了防止出现此 RuntimeError 并确保梯度可以顺利流动，上面 link 中提到了对我最有用的基本原理（但在解决方案消息中缺失），它包括使用 torch.Tensors 的 .clone() 方法移动它们（或它们的一部分）。

例如：

some_container[slice_indices] = original_tensor[slice_indices].clone()

其中只有 original_tensor 有 requires_grad=True，后续（可能是批处理的）操作将在张量 some_container.

上执行

或：

some_container = [
    tensor.clone() 
    for tensor in some_tensor_list if some_condition_fn(tensor)
]
new_composed_tensor = torch.cat(some_container, dim=0)

Cannot find in-place operation causing "RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation:"

Cannot find in-place operation causing "RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation:"

hessian-matrix

pytorch

autograd