为什么我们在 Pytorch Tensor 上调用 .numpy() 之前先调用 .detach()？

Question

我正在努力更好地理解原因。

在刚才链接的问题中，Blupon 指出：

You need to convert your tensor to another tensor that isn't requiring a gradient in addition to its actual value definition.

在他链接到的第一个讨论中，albanD 指出：

This is expected behavior because moving to numpy will break the graph and so no gradient will be computed.

If you don’t actually need gradients, then you can explicitly .detach() the Tensor that requires grad to get a tensor with the same content that does not require grad. This other Tensor can then be converted to a numpy array.

在他链接到的第二个讨论中，apaszke 写道：

Variable's can’t be transformed to numpy, because they’re wrappers around tensors that save the operation history, and numpy doesn’t have such objects. You can retrieve a tensor held by the Variable, using the .data attribute. Then, this should work: var.data.numpy().

我研究了 PyTorch 的自动微分库的内部工作原理，但我仍然对这些答案感到困惑。为什么它打破图表以移动到 numpy？是因为在 numpy 数组上的任何操作都不会在 autodiff 图中被跟踪吗？

什么是变量？它与张量有什么关系？

我觉得这里需要一个彻底的高质量 Stack-Overflow 答案，向尚不了解自动微分的 PyTorch 新用户解释其原因。

特别是，我认为通过图形说明图形并显示此示例中断开连接是如何发生的会很有帮助：

import torch

tensor1 = torch.tensor([1.0,2.0],requires_grad=True)

print(tensor1)
print(type(tensor1))

tensor1 = tensor1.numpy()

print(tensor1)
print(type(tensor1))

Answer 1

我问，为什么它会破坏图形以移动到 numpy？是因为在 numpy 数组上的任何操作都不会在 autodiff 图中被跟踪吗？

是的，新张量不会通过 grad_fn 连接到旧张量，因此对新张量的任何操作都不会将梯度带回旧张量。

写 my_tensor.detach().numpy() 只是说，“我将根据 numpy 数组中的张量值进行一些 non-tracked 计算。”

深入学习 (d2l) 教科书 has a nice section describing the detach() method，尽管它没有讨论为什么在转换为 numpy 数组之前分离有意义。

感谢jodag 帮忙回答这个问题。正如他所说，变量已过时，因此我们可以忽略该评论。

我认为到目前为止我能找到的最佳答案是 jodag's doc link:

To stop a tensor from tracking history, you can call .detach() to detach it from the computation history, and to prevent future computation from being tracked.

以及我在问题中引用的 albanD 的评论：

If you don’t actually need gradients, then you can explicitly .detach() the Tensor that requires grad to get a tensor with the same content that does not require grad. This other Tensor can then be converted to a numpy array.

换句话说，detach 方法意味着“我不想要梯度”，并且不可能通过 numpy 操作来跟踪梯度（毕竟，这就是 PyTorch 张量为了！）

Answer 2

我认为这里最重要的理解点是 torch.tensor 和 np.ndarray 之间的区别:
虽然这两个对象都用于存储 n-dimensional 矩阵（又名 "Tensors"），但 torch.tensors 有一个额外的“层”——它存储通向相关 n-dimensional 矩阵的计算图.

因此，如果您只对高效简单地对矩阵执行数学运算感兴趣，np.ndarray 或 torch.tensor 可以互换使用。

然而，torch.tensors 被设计用于 gradient descent optimization, and therefore they hold not only a tensor with numeric values, but (and more importantly) the computational graph leading to these values. This computational graph is then used (using the chain rule of derivatives) 计算损失函数的导数 w.r.t 每个自变量用于计算损失。

如前所述，np.ndarray 对象没有这个额外的“计算图”层，因此，在将 torch.tensor 转换为 np.ndarray 时，您必须显式使用detach()命令移除张量的计算图

计算图
从你的来看，这个概念似乎有点模糊。我将尝试用一个简单的例子来说明它。
考虑两个（矢量）变量的简单函数，x 和 w:

x = torch.rand(4, requires_grad=True)
w = torch.rand(4, requires_grad=True)

y = x @ w  # inner-product of x and w
z = y ** 2  # square the inner product

如果我们只对z的值感兴趣，我们不需要担心任何图表，我们只需从输入向前移动，x 和 w，计算 y，然后计算 z。

但是，如果我们不太关心z的值，而是想问“什么是w，最小化 z 对于给定的 x"?
要回答这个问题，我们需要计算 z w.r.t w.
的导数我们该怎么做？
使用 chain rule we know that dz/dw = dz/dy * dy/dw. That is, to compute the gradient of z w.r.t w we need to move from z back to w computing the gradient of the operation at each step as we trace 我们从 z 到 w 的步骤。我们回溯的这条“路径”是z的计算图，它告诉我们如何计算zw.r.t输入的导数领先到 z:

z.backward()  # ask pytorch to trace back the computation of z

我们现在可以检查 z w.r.t w:

的梯度

w.grad  # the resulting gradient of z w.r.t w
tensor([0.8010, 1.9746, 1.5904, 1.0408])

注意这正好等于

2*y*x
tensor([0.8010, 1.9746, 1.5904, 1.0408], grad_fn=<MulBackward0>)

因为 dz/dy = 2*y 和 dy/dw = x.

路径上的每个张量存储其对计算的“贡献”：

z
tensor(1.4061, grad_fn=<PowBackward0>)

和

y
tensor(1.1858, grad_fn=<DotBackward>)

如您所见，y 和 z 不仅存储 <x, w> 或 y**2 的“前向”值，而且还存储 的计算值graph -- 在从 z（输出）回溯到 w（输入）的梯度时计算导数（使用链式法则）所需的 grad_fn .

这些 grad_fn 是 torch.tensors 的基本组成部分，没有它们就无法计算复杂函数的导数。然而，np.ndarray们根本没有这个能力，他们没有这个信息。

有关使用 backwrd() 函数追溯导数的更多信息，请参阅。

由于np.ndarray和torch.tensor都有一个共同的“层”存储一个n-d数字数组，pytorch使用相同的存储来节省内存：

numpy() → numpy.ndarray
Returns self tensor as a NumPy ndarray. This tensor and the returned ndarray share the same underlying storage. Changes to self tensor will be reflected in the ndarray and vice versa.

另一个方向也以同样的方式工作：

torch.from_numpy(ndarray) → Tensor
Creates a Tensor from a numpy.ndarray.
The returned tensor and ndarray share the same memory. Modifications to the tensor will be reflected in the ndarray and vice versa.

因此，当从 torch.tensor 创建一个 np.array 时，反之亦然，两个对象引用内存中相同的底层存储。由于 np.ndarray 没有 store/represent 与数组关联的计算图，因此在共享 numpy 和 torch wish 时，应使用 detach() 显式删除此图引用相同的张量。

请注意，如果您出于某种原因希望仅将 pytorch 用于没有 back-propagation 的数学运算，您可以使用 with torch.no_grad() 上下文管理器，在这种情况下不会创建计算图并且torch.tensors 和 np.ndarrays 可以互换使用。

with torch.no_grad():
  x_t = torch.rand(3,4)
  y_np = np.ones((4, 2), dtype=np.float32)
  x_t @ torch.from_numpy(y_np)  # dot product in torch
  np.dot(x_t.numpy(), y_np)  # the same dot product in numpy

Answer 3

这是一个张量的小展示 -> numpy 数组连接：

import torch
tensor = torch.rand(2)
numpy_array = tensor.numpy()
print('Before edit:')
print(tensor)
print(numpy_array)

tensor[0] = 10

print()
print('After edit:')
print('Tensor:', tensor)
print('Numpy array:', numpy_array)

输出：

Before edit:
Tensor: tensor([0.1286, 0.4899])
Numpy array: [0.1285522  0.48987144]

After edit:
Tensor: tensor([10.0000,  0.4899])
Numpy array: [10.        0.48987144]

第一个元素的值由张量和numpy数组共享。在张量中将其更改为 10 也会在 numpy 数组中更改它。

为什么我们在 Pytorch Tensor 上调用 .numpy() 之前先调用 .detach()？

Why do we call .detach() before calling .numpy() on a Pytorch Tensor?

numpy

autodiff

pytorch