Cannot find in-place operation causing "RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation:"

Cannot find in-place operation causing "RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation:"

我对 PyTorch 比较陌生,我正在尝试从一篇学术论文中重现一种算法,该算法使用 Hessian 矩阵来近似项。我设置了一个玩具问题,以便我可以将完整 Hessian 矩阵的结果与近似值进行比较。我发现 this gist 并一直在使用它来计算算法的完整 Hessian 部分。

我收到错误:"RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation."

我搜索了简单的示例代码、文档和许多关于此问题的论坛帖子,但找不到任何就地操作。任何帮助将不胜感激!

这是我的代码:

import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np

torch.set_printoptions(precision=20, linewidth=180)

def jacobian(y, x, create_graph=False):
    jac = []
    flat_y = y.reshape(-1)     
    grad_y = torch.zeros_like(flat_y)     

    for i in range(len(flat_y)):         
        grad_y[i] = 1.
        grad_x, = torch.autograd.grad(flat_y, x, grad_y, retain_graph=True, create_graph=create_graph)
        jac.append(grad_x.reshape(x.shape))
        grad_y[i] = 0.
    return torch.stack(jac).reshape(y.shape + x.shape)           

def hessian(y, x):
    return jacobian(jacobian(y, x, create_graph=True), x)                                             

def f(x):                                                                                             
    return x * x

np.random.seed(435537698)

num_dims = 2
num_samples = 3

X = [np.random.uniform(size=num_dims) for i in range(num_samples)]
print('X: \n{}\n\n'.format(X))

mean = torch.Tensor(np.mean(X, axis=0))
mean.requires_grad = True
print('mean: \n{}\n\n'.format(mean))

cov = torch.Tensor(np.cov(X, rowvar=False))
print('cov: \n{}\n\n'.format(cov))

with autograd.detect_anomaly():
    hessian_matrices = hessian(f(mean), mean)
    print('hessian: \n{}\n\n'.format(hessian_matrices))

这里是堆栈跟踪的输出:

X: 
[array([0.81700949, 0.17141617]), array([0.53579366, 0.31141496]), array([0.49756485, 0.97495776])]


mean: 
tensor([0.61678934097290039062, 0.48592963814735412598], requires_grad=True)


cov: 
tensor([[ 0.03043144382536411285, -0.05357056483626365662],
        [-0.05357056483626365662,  0.18426130712032318115]])


---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-3-5a1c492d2873> in <module>()
     42 
     43 with autograd.detect_anomaly():
---> 44     hessian_matrices = hessian(f(mean), mean)
     45     print('hessian: \n{}\n\n'.format(hessian_matrices))

2 frames
<ipython-input-3-5a1c492d2873> in hessian(y, x)
     21 
     22 def hessian(y, x):
---> 23     return jacobian(jacobian(y, x, create_graph=True), x)
     24 
     25 def f(x):

<ipython-input-3-5a1c492d2873> in jacobian(y, x, create_graph)
     15     for i in range(len(flat_y)):
     16         grad_y[i] = 1.
---> 17         grad_x, = torch.autograd.grad(flat_y, x, grad_y, retain_graph=True, create_graph=create_graph)
     18         jac.append(grad_x.reshape(x.shape))
     19         grad_y[i] = 0.

/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py in grad(outputs, inputs, grad_outputs, retain_graph, create_graph, only_inputs, allow_unused)
    155     return Variable._execution_engine.run_backward(
    156         outputs, grad_outputs, retain_graph, create_graph,
--> 157         inputs, allow_unused)
    158 
    159 

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [2]] is at version 4; expected version 3 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

我真诚地认为这是 PyTorch 中的一个错误,但在发布错误后,我从 albanD 得到了一个很好的答案。 https://github.com/pytorch/pytorch/issues/36903#issuecomment-616671247 He also pointed out that https://discuss.pytorch.org/ 可以提问。

问题的出现是因为我们多次遍历计算图。虽然这里发生的事情超出了我的范围...

您的错误消息所指的就地编辑是显而易见的:grad_y[i] = 1.grad_y[i] = 0.。在计算中一遍又一遍地求 grad_y 是造成麻烦的原因。如下重新定义 jacobian(...) 对我有用。

def jacobian(y, x, create_graph=False):
    jac = []
    flat_y = y.reshape(-1)
    for i in range(len(flat_y)):
        grad_y = torch.zeros_like(flat_y)
        grad_y[i] = 1.
        grad_x, = torch.autograd.grad(flat_y, x, grad_y, retain_graph=True, create_graph=create_graph)
        jac.append(grad_x.reshape(x.shape))
    return torch.stack(jac).reshape(y.shape + x.shape)

另一种方法可行,但对我来说更像是黑魔法,它让 jacobian(...) 保持原样,而是将 f(x) 重新定义为

def f(x):
    return x * x * 1

这也行。

对于未来的读者,标题中提到的 RuntimeError 可能会出现在比原作者更一般的设置中,例如在张量切片周围移动时 and/or 从列表推导中操纵张量,因为这是引导我来到这里的上下文(我的搜索引擎为 RuntimeError 返回的第一个 link)。

为了防止出现此 RuntimeError 并确保梯度可以顺利流动,上面 link 中提到了对我最有用的基本原理(但在解决方案消息中缺失),它包括使用 torch.Tensors 的 .clone() 方法移动它们(或它们的一部分)。

例如:

some_container[slice_indices] = original_tensor[slice_indices].clone()

其中只有 original_tensorrequires_grad=True,后续(可能是批处理的)操作将在张量 some_container.

上执行

或:

some_container = [
    tensor.clone() 
    for tensor in some_tensor_list if some_condition_fn(tensor)
]
new_composed_tensor = torch.cat(some_container, dim=0)