正向与反向模式区分 - Pytorch
Forward vs reverse mode differentiation - Pytorch
在Learning PyTorch with Examples的第一个例子中,作者演示了如何使用numpy创建神经网络。为方便起见,将他们的代码粘贴在下面:
# from: https://pytorch.org/tutorials/beginner/pytorch_with_examples.html
# -*- coding: utf-8 -*-
import numpy as np
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)
# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)
learning_rate = 1e-6
for t in range(500):
# Forward pass: compute predicted y
h = x.dot(w1)
h_relu = np.maximum(h, 0)
y_pred = h_relu.dot(w2)
# Compute and print loss
loss = np.square(y_pred - y).sum()
print(t, loss)
# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h_relu.T.dot(grad_y_pred)
grad_h_relu = grad_y_pred.dot(w2.T)
grad_h = grad_h_relu.copy()
grad_h[h < 0] = 0
grad_w1 = x.T.dot(grad_h)
# Update weights
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
让我感到困惑的是,为什么要计算 w1 和 w2 的梯度关于损失(倒数第二个代码块)。
通常情况下会发生相反的计算:损失梯度是根据权重计算的,如此处引用:
- "When training neural networks, we think of the cost (a value describing how bad a neural network performs) as a function of the parameters (numbers describing how the network behaves). We want to calculate the derivatives of the cost with respect to all the parameters, for use in gradient descent. Now, there’s often millions, or even tens of millions of parameters in a neural network. So, reverse-mode differentiation, called backpropagation in the context of neural networks, gives us a massive speed up!" (Colah's blog).
所以我的问题是:与正常反向传播计算相比,为什么上面示例中的推导计算顺序相反?
好像是评论打错了。他们实际上是在计算loss
w.r.t的梯度。 w2
和 w1
.
让我们快速推导出loss
w.r.t的梯度。 w2
只是为了确定。通过检查您的代码,我们有
使用微积分的链式法则
.
每一项都可以使用矩阵微积分的基本规则来表示。结果是
和
.
将这些项重新代入我们得到的初始方程
.
与
所描述的表达式完全匹配
grad_y_pred = 2.0 * (y_pred - y) # gradient of loss w.r.t. y_pred
grad_w2 = h_relu.T.dot(grad_y_pred) # gradient of loss w.r.t. w2
在您提供的反向传播代码中。
在Learning PyTorch with Examples的第一个例子中,作者演示了如何使用numpy创建神经网络。为方便起见,将他们的代码粘贴在下面:
# from: https://pytorch.org/tutorials/beginner/pytorch_with_examples.html
# -*- coding: utf-8 -*-
import numpy as np
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)
# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)
learning_rate = 1e-6
for t in range(500):
# Forward pass: compute predicted y
h = x.dot(w1)
h_relu = np.maximum(h, 0)
y_pred = h_relu.dot(w2)
# Compute and print loss
loss = np.square(y_pred - y).sum()
print(t, loss)
# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h_relu.T.dot(grad_y_pred)
grad_h_relu = grad_y_pred.dot(w2.T)
grad_h = grad_h_relu.copy()
grad_h[h < 0] = 0
grad_w1 = x.T.dot(grad_h)
# Update weights
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
让我感到困惑的是,为什么要计算 w1 和 w2 的梯度关于损失(倒数第二个代码块)。
通常情况下会发生相反的计算:损失梯度是根据权重计算的,如此处引用:
- "When training neural networks, we think of the cost (a value describing how bad a neural network performs) as a function of the parameters (numbers describing how the network behaves). We want to calculate the derivatives of the cost with respect to all the parameters, for use in gradient descent. Now, there’s often millions, or even tens of millions of parameters in a neural network. So, reverse-mode differentiation, called backpropagation in the context of neural networks, gives us a massive speed up!" (Colah's blog).
所以我的问题是:与正常反向传播计算相比,为什么上面示例中的推导计算顺序相反?
好像是评论打错了。他们实际上是在计算loss
w.r.t的梯度。 w2
和 w1
.
让我们快速推导出loss
w.r.t的梯度。 w2
只是为了确定。通过检查您的代码,我们有
使用微积分的链式法则
每一项都可以使用矩阵微积分的基本规则来表示。结果是
和
将这些项重新代入我们得到的初始方程
与
所描述的表达式完全匹配grad_y_pred = 2.0 * (y_pred - y) # gradient of loss w.r.t. y_pred
grad_w2 = h_relu.T.dot(grad_y_pred) # gradient of loss w.r.t. w2
在您提供的反向传播代码中。