多层神经网络反向传播公式(使用随机梯度下降)
Multi-layer neural network back-propagation formula (using stochastic gradient descent)
使用 Backpropagation calculus | Deep learning, chapter 4 中的符号,我有一个 4 层(即 2 个隐藏层)神经网络的反向传播代码:
def sigmoid_prime(z):
return z * (1-z) # because σ'(x) = σ(x) (1 - σ(x))
def train(self, input_vector, target_vector):
a = np.array(input_vector, ndmin=2).T
y = np.array(target_vector, ndmin=2).T
# forward
A = [a]
for k in range(3):
a = sigmoid(np.dot(self.weights[k], a)) # zero bias here just for simplicity
A.append(a)
# Now A has 4 elements: the input vector + the 3 outputs vectors
# back-propagation
delta = a - y
for k in [2, 1, 0]:
tmp = delta * sigmoid_prime(A[k+1])
delta = np.dot(self.weights[k].T, tmp) # (1) <---- HERE
self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)
有效,但是:
最后的准确性(对于我的用例:MNIST 数字识别)还可以,但不是很好。
将第(1)行换成:
会好很多(即收敛性好很多)
delta = np.dot(self.weights[k].T, delta) # (2)
来自 Machine Learning with Python: Training and Testing the Neural Network with MNIST data set 的代码还建议:
delta = np.dot(self.weights[k].T, delta)
而不是:
delta = np.dot(self.weights[k].T, tmp)
(加上本文的记法,就是:
output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors)
)
这两个论点似乎是一致的:代码 (2) 优于代码 (1)。
然而,数学似乎表明相反(参见 video here;另一个细节:请注意,我的损失函数乘以 1/2 而视频中没有):
问题:实施(1)或(2)哪一个是正确的?
在 LaTeX 中:
$$C = \frac{1}{2} (a^L - y)^2$$
$$a^L = \sigma(\underbrace{w^L a^{L-1} + b^L}_{z^L}) = \sigma(z^L)$$
$$\frac{\partial{C}}{\partial{w^L}} = \frac{\partial{z^L}}{\partial{w^L}} \frac{\partial{a^L}}{\partial{z^L}} \frac{\partial{C}}{\partial{a^L}}=a^{L-1} \sigma'(z^L)(a^L-y)$$
$$\frac{\partial{C}}{\partial{a^{L-1}}} = \frac{\partial{z^L}}{\partial{a^{L-1}}} \frac{\partial{a^L}}{\partial{z^L}} \frac{\partial{C}}{\partial{a^L}}=w^L \sigma'(z^L)(a^L-y)$$
$$\frac{\partial{C}}{\partial{w^{L-1}}} = \frac{\partial{z^{L-1}}}{\partial{w^{L-1}}} \frac{\partial{a^{L-1}}}{\partial{z^{L-1}}} \frac{\partial{C}}{\partial{a^{L-1}}}=a^{L-2} \sigma'(z^{L-1}) \times w^L \sigma'(z^L)(a^L-y)$$
我花了两天时间分析这个问题,我在笔记本上写了几页偏导数计算...我可以确认:
- 问题中用 LaTeX 写的数学是正确的
代码 (1) 是正确的,并且符合数学计算:
delta = a - y
for k in [2, 1, 0]:
tmp = delta * sigmoid_prime(A[k+1])
delta = np.dot(self.weights[k].T, tmp)
self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)
代码(2)错误:
delta = a - y
for k in [2, 1, 0]:
tmp = delta * sigmoid_prime(A[k+1])
delta = np.dot(self.weights[k].T, delta) # WRONG HERE
self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)
Machine Learning with Python: Training and Testing the Neural Network with MNIST data set中有一个小错误:
output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors)
应该是
output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors * out_vector * (1.0 - out_vector))
现在是我花了好几天才意识到的困难部分:
显然代码(2)的收敛性比代码(1)好得多,这就是为什么我误以为代码(2)是正确的而代码(1)是错误的
...但其实这只是巧合因为learning_rate
设置的太低了.原因如下:当使用代码 (2) 时,参数 delta
比代码 (1) 增长得更快(print np.linalg.norm(delta)
有助于看到这一点)。
因此 "incorrect code (2)" 只是通过具有更大的 delta
参数来补偿 "too slow learning rate",并且在某些情况下,它会导致明显更快的收敛。
现已解决!
使用 Backpropagation calculus | Deep learning, chapter 4 中的符号,我有一个 4 层(即 2 个隐藏层)神经网络的反向传播代码:
def sigmoid_prime(z):
return z * (1-z) # because σ'(x) = σ(x) (1 - σ(x))
def train(self, input_vector, target_vector):
a = np.array(input_vector, ndmin=2).T
y = np.array(target_vector, ndmin=2).T
# forward
A = [a]
for k in range(3):
a = sigmoid(np.dot(self.weights[k], a)) # zero bias here just for simplicity
A.append(a)
# Now A has 4 elements: the input vector + the 3 outputs vectors
# back-propagation
delta = a - y
for k in [2, 1, 0]:
tmp = delta * sigmoid_prime(A[k+1])
delta = np.dot(self.weights[k].T, tmp) # (1) <---- HERE
self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)
有效,但是:
最后的准确性(对于我的用例:MNIST 数字识别)还可以,但不是很好。 将第(1)行换成:
会好很多(即收敛性好很多)delta = np.dot(self.weights[k].T, delta) # (2)
来自 Machine Learning with Python: Training and Testing the Neural Network with MNIST data set 的代码还建议:
delta = np.dot(self.weights[k].T, delta)
而不是:
delta = np.dot(self.weights[k].T, tmp)
(加上本文的记法,就是:
output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors)
)
这两个论点似乎是一致的:代码 (2) 优于代码 (1)。
然而,数学似乎表明相反(参见 video here;另一个细节:请注意,我的损失函数乘以 1/2 而视频中没有):
问题:实施(1)或(2)哪一个是正确的?
在 LaTeX 中:
$$C = \frac{1}{2} (a^L - y)^2$$
$$a^L = \sigma(\underbrace{w^L a^{L-1} + b^L}_{z^L}) = \sigma(z^L)$$
$$\frac{\partial{C}}{\partial{w^L}} = \frac{\partial{z^L}}{\partial{w^L}} \frac{\partial{a^L}}{\partial{z^L}} \frac{\partial{C}}{\partial{a^L}}=a^{L-1} \sigma'(z^L)(a^L-y)$$
$$\frac{\partial{C}}{\partial{a^{L-1}}} = \frac{\partial{z^L}}{\partial{a^{L-1}}} \frac{\partial{a^L}}{\partial{z^L}} \frac{\partial{C}}{\partial{a^L}}=w^L \sigma'(z^L)(a^L-y)$$
$$\frac{\partial{C}}{\partial{w^{L-1}}} = \frac{\partial{z^{L-1}}}{\partial{w^{L-1}}} \frac{\partial{a^{L-1}}}{\partial{z^{L-1}}} \frac{\partial{C}}{\partial{a^{L-1}}}=a^{L-2} \sigma'(z^{L-1}) \times w^L \sigma'(z^L)(a^L-y)$$
我花了两天时间分析这个问题,我在笔记本上写了几页偏导数计算...我可以确认:
- 问题中用 LaTeX 写的数学是正确的
代码 (1) 是正确的,并且符合数学计算:
delta = a - y for k in [2, 1, 0]: tmp = delta * sigmoid_prime(A[k+1]) delta = np.dot(self.weights[k].T, tmp) self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)
代码(2)错误:
delta = a - y for k in [2, 1, 0]: tmp = delta * sigmoid_prime(A[k+1]) delta = np.dot(self.weights[k].T, delta) # WRONG HERE self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)
Machine Learning with Python: Training and Testing the Neural Network with MNIST data set中有一个小错误:
output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors)
应该是
output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors * out_vector * (1.0 - out_vector))
现在是我花了好几天才意识到的困难部分:
显然代码(2)的收敛性比代码(1)好得多,这就是为什么我误以为代码(2)是正确的而代码(1)是错误的
...但其实这只是巧合因为
learning_rate
设置的太低了.原因如下:当使用代码 (2) 时,参数delta
比代码 (1) 增长得更快(print np.linalg.norm(delta)
有助于看到这一点)。因此 "incorrect code (2)" 只是通过具有更大的
delta
参数来补偿 "too slow learning rate",并且在某些情况下,它会导致明显更快的收敛。
现已解决!