异或神经网络反向传播
XOR neural network backprop
我正在尝试在 Python 中实现带有 1 个隐藏层的基本 XOR NN。我不是特别了解反向传播算法,所以我一直坚持获取 delta2 和更新权重...求助?
import numpy as np
def sigmoid(x):
return 1.0 / (1.0 + np.exp(-x))
vec_sigmoid = np.vectorize(sigmoid)
theta1 = np.matrix(np.random.rand(3,3))
theta2 = np.matrix(np.random.rand(3,1))
def fit(x, y, theta1, theta2, learn_rate=.001):
#forward pass
layer1 = np.matrix(x, dtype='f')
layer1 = np.c_[np.ones(1), layer1]
layer2 = vec_sigmoid(layer1*theta1)
layer3 = sigmoid(layer2*theta2)
#backprop
delta3 = y - layer3
delta2 = (theta2*delta3) * np.multiply(layer2, 1 - layer2) #??
#update weights
theta2 += learn_rate * delta3 #??
theta1 += learn_rate * delta2 #??
def train(X, Y):
for _ in range(10000):
for i in range(4):
x = X[i]
y = Y[i]
fit(x, y, theta1, theta2)
X = [(0,0), (1,0), (0,1), (1,1)]
Y = [0, 1, 1, 0]
train(X, Y)
好的,所以,首先,这是使您的代码正常工作的修改后的代码。
#! /usr/bin/python
import numpy as np
def sigmoid(x):
return 1.0 / (1.0 + np.exp(-x))
vec_sigmoid = np.vectorize(sigmoid)
# Binesh - just cleaning it up, so you can easily change the number of hiddens.
# Also, initializing with a heuristic from Yoshua Bengio.
# In many places you were using matrix multiplication and elementwise multiplication
# interchangably... You can't do that.. (So I explicitly changed everything to be
# dot products and multiplies so it's clear.)
input_sz = 2;
hidden_sz = 3;
output_sz = 1;
theta1 = np.matrix(0.5 * np.sqrt(6.0 / (input_sz+hidden_sz)) * (np.random.rand(1+input_sz,hidden_sz)-0.5))
theta2 = np.matrix(0.5 * np.sqrt(6.0 / (hidden_sz+output_sz)) * (np.random.rand(1+hidden_sz,output_sz)-0.5))
def fit(x, y, theta1, theta2, learn_rate=.1):
#forward pass
layer1 = np.matrix(x, dtype='f')
layer1 = np.c_[np.ones(1), layer1]
# Binesh - for layer2 we need to add a bias term.
layer2 = np.c_[np.ones(1), vec_sigmoid(layer1.dot(theta1))]
layer3 = sigmoid(layer2.dot(theta2))
#backprop
delta3 = y - layer3
# Binesh - In reality, this is the _negative_ derivative of the cross entropy function
# wrt the _input_ to the final sigmoid function.
delta2 = np.multiply(delta3.dot(theta2.T), np.multiply(layer2, (1-layer2)))
# Binesh - We actually don't use the delta for the bias term. (What would be the point?
# it has no inputs. Hence the line below.
delta2 = delta2[:,1:]
# But, delta's are just derivatives wrt the inputs to the sigmoid.
# We don't add those to theta directly. We have to multiply these by
# the preceding layer to get the theta2d's and theta1d's
theta2d = np.dot(layer2.T, delta3)
theta1d = np.dot(layer1.T, delta2)
#update weights
# Binesh - here you had delta3 and delta2... Those are not the
# the derivatives wrt the theta's, they are the derivatives wrt
# the inputs to the sigmoids.. (As I mention above)
theta2 += learn_rate * theta2d #??
theta1 += learn_rate * theta1d #??
def train(X, Y):
for _ in range(10000):
for i in range(4):
x = X[i]
y = Y[i]
fit(x, y, theta1, theta2)
# Binesh - Here's a little test function to see that it actually works
def test(X):
for i in range(4):
layer1 = np.matrix(X[i],dtype='f')
layer1 = np.c_[np.ones(1), layer1]
layer2 = np.c_[np.ones(1), vec_sigmoid(layer1.dot(theta1))]
layer3 = sigmoid(layer2.dot(theta2))
print "%d xor %d = %.7f" % (layer1[0,1], layer1[0,2], layer3[0,0])
X = [(0,0), (1,0), (0,1), (1,1)]
Y = [0, 1, 1, 0]
train(X, Y)
# Binesh - Alright, let's see!
test(X)
并且,现在进行一些解释。原谅粗糙的绘图。在 gimp 中拍照比画东西更容易。
(来源:binesh at cablemodem.hex21.com)
所以。首先,我们有误差函数。我们将其称为 CE(交叉熵。我将尽可能使用您的变量,不过,我将使用 L1、L2 和 L3 而不是 layer1、layer2 和 layer3。叹息(我这里不知道怎么搞latex,好像在statistics stack exchange上起作用,怪怪的。)
CE = -(Y log(L3) + (1-Y) log(1-L3))
我们需要对 L3 求导,这样我们就可以看到如何移动 L3 以 减少 这个值。
dCE/dL3 = -((Y/L3) - (1-Y)/(1-L3))
= -((Y(1-L3) - (1-Y)L3) / (L3(1-L3)))
= -(((Y-Y*L3) - (L3-Y*L3)) / (L3(1-L3)))
= -((Y-Y3*L3 + Y3*L3 - L3) / (L3(1-L3)))
= -((Y-L3) / (L3(1-L3)))
= ((L3-Y) / (L3(1-L3)))
很好,但实际上,我们不能随心所欲地改变 L3。 L3是Z3的函数(看我的图)
L3 = sigmoid(Z3)
dL3/dZ3 = L3(1-L3)
我不是在这里推导这个,(sigmoid 的导数)但是,它实际上并不难证明。
但是,无论如何,这是 L3 对 Z3 的导数,但我们想要 CE 对 Z3 的导数。
dCE/dZ3 = (dCE/dL3) * (dL3/dZ3)
= ((L3-Y)/(L3(1-L3)) * (L3(1-L3)) # Hey, look at that. The denominator gets cancelled out and
= (L3-Y) # This is why in my comments I was saying what you are computing is the _negative_ derivative.
我们称 Z 的导数为 "deltas"。因此,在您的代码中,这对应于 delta3.
很好,但我们也不能随心所欲地更改 Z3。我们需要计算它对 L2 的导数。
但这更复杂。
Z3 = theta2(0) + theta2(1) * L2(1) + theta2(2) * L2(2) + theta2(3) * L2(3)
所以,我们需要对 wrt 求偏导数。 L2(1)、L2(2) 和 L2(3)
dZ3/dL2(1) = theta2(1)
dZ3/dL2(2) = theta2(2)
dZ3/dL2(3) = theta2(3)
注意偏差实际上是
dZ3/dBias = theta2(0)
但偏差永远不会改变,它始终为 1,因此我们可以安全地忽略它。但是,我们的 layer2 包含偏差,所以我们暂时保留它。
但是,再次,我们想要导数 wrt Z2(0), Z2(1), Z2(2) ,我想。)
dL2(1)/dZ2(0) = L2(1) * (1-L2(1))
dL2(2)/dZ2(1) = L2(2) * (1-L2(2))
dL2(3)/dZ2(2) = L2(3) * (1-L2(3))
现在是什么 dCE/dZ2(0..2)
dCE/dZ2(0) = dCE/dZ3 * dZ3/dL2(1) * dL2(1)/dZ2(0)
= (L3-Y) * theta2(1) * L2(1) * (1-L2(1))
dCE/dZ2(1) = dCE/dZ3 * dZ3/dL2(2) * dL2(2)/dZ2(1)
= (L3-Y) * theta2(2) * L2(2) * (1-L2(2))
dCE/dZ2(2) = dCE/dZ3 * dZ3/dL2(3) * dL2(3)/dZ2(2)
= (L3-Y) * theta2(3) * L2(3) * (1-L2(3))
但是,实际上我们可以将其表示为 (delta3 * Transpose[theta2]) 元素乘以 (L2 * (1-L2))(其中 L2 是向量)
这些是我们的 delta2 层。我删除了它的第一个条目,因为正如我上面提到的,它对应于偏差的增量(我在图表上标记为 L2(0)。)
所以。现在,我们有了 Z 的导数,但实际上,我们可以修改的只是我们的 thetas。
Z3 = theta2(0) + theta2(1) * L2(1) + theta2(2) * L2(2) + theta2(3) * L2(3)
dZ3/dtheta2(0) = 1
dZ3/dtheta2(1) = L2(1)
dZ3/dtheta2(2) = L2(2)
dZ3/dtheta2(3) = L2(3)
再一次,我们想要 dCE/dtheta2(0),这样就变成了
dCE/dtheta2(0) = dCE/dZ3 * dZ3/dtheta2(0)
= (L3-Y) * 1
dCE/dtheta2(1) = dCE/dZ3 * dZ3/dtheta2(1)
= (L3-Y) * L2(1)
dCE/dtheta2(2) = dCE/dZ3 * dZ3/dtheta2(2)
= (L3-Y) * L2(2)
dCE/dtheta2(3) = dCE/dZ3 * dZ3/dtheta2(3)
= (L3-Y) * L2(3)
好吧,这只是 np.dot(layer2.T, delta3),这就是我在 theta2d
中的内容
同样地:
Z2(0) = theta1(0,0) + theta1(1,0) * L1(1) + theta1(2,0) * L1(2)
dZ2(0)/dtheta1(0,0) = 1
dZ2(0)/dtheta1(1,0) = L1(1)
dZ2(0)/dtheta1(2,0) = L1(2)
Z2(1) = theta1(0,1) + theta1(1,1) * L1(1) + theta1(2,1) * L1(2)
dZ2(1)/dtheta1(0,1) = 1
dZ2(1)/dtheta1(1,1) = L1(1)
dZ2(1)/dtheta1(2,1) = L1(2)
Z2(2) = theta1(0,2) + theta1(1,2) * L1(1) + theta1(2,2) * L1(2)
dZ2(2)/dtheta1(0,2) = 1
dZ2(2)/dtheta1(1,2) = L1(1)
dZ2(2)/dtheta1(2,2) = L1(2)
而且,我们必须乘以 dCE/dZ2(0)、dCE/dZ2(1) 和 dCE/dZ2(2)(对于上面的三个组中的每一个。但是,如果你考虑一下,那就变成了 np.dot(layer1.T, delta2),这就是我在 theta1d.
中的内容
现在,因为您在代码中执行了 Y-L3,所以您正在 添加 到 theta1 和 theta2...但是,这是推理。我们刚刚在上面计算的是 CE 对权重的导数。所以,这意味着,增加权重将 增加 CE。但是,我们真的想减少 CE..所以,我们(通常)减去。但是,因为在您的代码中,您正在计算负导数,所以您添加是正确的。
这有意义吗?
我正在尝试在 Python 中实现带有 1 个隐藏层的基本 XOR NN。我不是特别了解反向传播算法,所以我一直坚持获取 delta2 和更新权重...求助?
import numpy as np
def sigmoid(x):
return 1.0 / (1.0 + np.exp(-x))
vec_sigmoid = np.vectorize(sigmoid)
theta1 = np.matrix(np.random.rand(3,3))
theta2 = np.matrix(np.random.rand(3,1))
def fit(x, y, theta1, theta2, learn_rate=.001):
#forward pass
layer1 = np.matrix(x, dtype='f')
layer1 = np.c_[np.ones(1), layer1]
layer2 = vec_sigmoid(layer1*theta1)
layer3 = sigmoid(layer2*theta2)
#backprop
delta3 = y - layer3
delta2 = (theta2*delta3) * np.multiply(layer2, 1 - layer2) #??
#update weights
theta2 += learn_rate * delta3 #??
theta1 += learn_rate * delta2 #??
def train(X, Y):
for _ in range(10000):
for i in range(4):
x = X[i]
y = Y[i]
fit(x, y, theta1, theta2)
X = [(0,0), (1,0), (0,1), (1,1)]
Y = [0, 1, 1, 0]
train(X, Y)
好的,所以,首先,这是使您的代码正常工作的修改后的代码。
#! /usr/bin/python
import numpy as np
def sigmoid(x):
return 1.0 / (1.0 + np.exp(-x))
vec_sigmoid = np.vectorize(sigmoid)
# Binesh - just cleaning it up, so you can easily change the number of hiddens.
# Also, initializing with a heuristic from Yoshua Bengio.
# In many places you were using matrix multiplication and elementwise multiplication
# interchangably... You can't do that.. (So I explicitly changed everything to be
# dot products and multiplies so it's clear.)
input_sz = 2;
hidden_sz = 3;
output_sz = 1;
theta1 = np.matrix(0.5 * np.sqrt(6.0 / (input_sz+hidden_sz)) * (np.random.rand(1+input_sz,hidden_sz)-0.5))
theta2 = np.matrix(0.5 * np.sqrt(6.0 / (hidden_sz+output_sz)) * (np.random.rand(1+hidden_sz,output_sz)-0.5))
def fit(x, y, theta1, theta2, learn_rate=.1):
#forward pass
layer1 = np.matrix(x, dtype='f')
layer1 = np.c_[np.ones(1), layer1]
# Binesh - for layer2 we need to add a bias term.
layer2 = np.c_[np.ones(1), vec_sigmoid(layer1.dot(theta1))]
layer3 = sigmoid(layer2.dot(theta2))
#backprop
delta3 = y - layer3
# Binesh - In reality, this is the _negative_ derivative of the cross entropy function
# wrt the _input_ to the final sigmoid function.
delta2 = np.multiply(delta3.dot(theta2.T), np.multiply(layer2, (1-layer2)))
# Binesh - We actually don't use the delta for the bias term. (What would be the point?
# it has no inputs. Hence the line below.
delta2 = delta2[:,1:]
# But, delta's are just derivatives wrt the inputs to the sigmoid.
# We don't add those to theta directly. We have to multiply these by
# the preceding layer to get the theta2d's and theta1d's
theta2d = np.dot(layer2.T, delta3)
theta1d = np.dot(layer1.T, delta2)
#update weights
# Binesh - here you had delta3 and delta2... Those are not the
# the derivatives wrt the theta's, they are the derivatives wrt
# the inputs to the sigmoids.. (As I mention above)
theta2 += learn_rate * theta2d #??
theta1 += learn_rate * theta1d #??
def train(X, Y):
for _ in range(10000):
for i in range(4):
x = X[i]
y = Y[i]
fit(x, y, theta1, theta2)
# Binesh - Here's a little test function to see that it actually works
def test(X):
for i in range(4):
layer1 = np.matrix(X[i],dtype='f')
layer1 = np.c_[np.ones(1), layer1]
layer2 = np.c_[np.ones(1), vec_sigmoid(layer1.dot(theta1))]
layer3 = sigmoid(layer2.dot(theta2))
print "%d xor %d = %.7f" % (layer1[0,1], layer1[0,2], layer3[0,0])
X = [(0,0), (1,0), (0,1), (1,1)]
Y = [0, 1, 1, 0]
train(X, Y)
# Binesh - Alright, let's see!
test(X)
并且,现在进行一些解释。原谅粗糙的绘图。在 gimp 中拍照比画东西更容易。
(来源:binesh at cablemodem.hex21.com)
所以。首先,我们有误差函数。我们将其称为 CE(交叉熵。我将尽可能使用您的变量,不过,我将使用 L1、L2 和 L3 而不是 layer1、layer2 和 layer3。叹息(我这里不知道怎么搞latex,好像在statistics stack exchange上起作用,怪怪的。)
CE = -(Y log(L3) + (1-Y) log(1-L3))
我们需要对 L3 求导,这样我们就可以看到如何移动 L3 以 减少 这个值。
dCE/dL3 = -((Y/L3) - (1-Y)/(1-L3))
= -((Y(1-L3) - (1-Y)L3) / (L3(1-L3)))
= -(((Y-Y*L3) - (L3-Y*L3)) / (L3(1-L3)))
= -((Y-Y3*L3 + Y3*L3 - L3) / (L3(1-L3)))
= -((Y-L3) / (L3(1-L3)))
= ((L3-Y) / (L3(1-L3)))
很好,但实际上,我们不能随心所欲地改变 L3。 L3是Z3的函数(看我的图)
L3 = sigmoid(Z3)
dL3/dZ3 = L3(1-L3)
我不是在这里推导这个,(sigmoid 的导数)但是,它实际上并不难证明。
但是,无论如何,这是 L3 对 Z3 的导数,但我们想要 CE 对 Z3 的导数。
dCE/dZ3 = (dCE/dL3) * (dL3/dZ3)
= ((L3-Y)/(L3(1-L3)) * (L3(1-L3)) # Hey, look at that. The denominator gets cancelled out and
= (L3-Y) # This is why in my comments I was saying what you are computing is the _negative_ derivative.
我们称 Z 的导数为 "deltas"。因此,在您的代码中,这对应于 delta3.
很好,但我们也不能随心所欲地更改 Z3。我们需要计算它对 L2 的导数。
但这更复杂。
Z3 = theta2(0) + theta2(1) * L2(1) + theta2(2) * L2(2) + theta2(3) * L2(3)
所以,我们需要对 wrt 求偏导数。 L2(1)、L2(2) 和 L2(3)
dZ3/dL2(1) = theta2(1)
dZ3/dL2(2) = theta2(2)
dZ3/dL2(3) = theta2(3)
注意偏差实际上是
dZ3/dBias = theta2(0)
但偏差永远不会改变,它始终为 1,因此我们可以安全地忽略它。但是,我们的 layer2 包含偏差,所以我们暂时保留它。
但是,再次,我们想要导数 wrt Z2(0), Z2(1), Z2(2) ,我想。)
dL2(1)/dZ2(0) = L2(1) * (1-L2(1))
dL2(2)/dZ2(1) = L2(2) * (1-L2(2))
dL2(3)/dZ2(2) = L2(3) * (1-L2(3))
现在是什么 dCE/dZ2(0..2)
dCE/dZ2(0) = dCE/dZ3 * dZ3/dL2(1) * dL2(1)/dZ2(0)
= (L3-Y) * theta2(1) * L2(1) * (1-L2(1))
dCE/dZ2(1) = dCE/dZ3 * dZ3/dL2(2) * dL2(2)/dZ2(1)
= (L3-Y) * theta2(2) * L2(2) * (1-L2(2))
dCE/dZ2(2) = dCE/dZ3 * dZ3/dL2(3) * dL2(3)/dZ2(2)
= (L3-Y) * theta2(3) * L2(3) * (1-L2(3))
但是,实际上我们可以将其表示为 (delta3 * Transpose[theta2]) 元素乘以 (L2 * (1-L2))(其中 L2 是向量)
这些是我们的 delta2 层。我删除了它的第一个条目,因为正如我上面提到的,它对应于偏差的增量(我在图表上标记为 L2(0)。)
所以。现在,我们有了 Z 的导数,但实际上,我们可以修改的只是我们的 thetas。
Z3 = theta2(0) + theta2(1) * L2(1) + theta2(2) * L2(2) + theta2(3) * L2(3)
dZ3/dtheta2(0) = 1
dZ3/dtheta2(1) = L2(1)
dZ3/dtheta2(2) = L2(2)
dZ3/dtheta2(3) = L2(3)
再一次,我们想要 dCE/dtheta2(0),这样就变成了
dCE/dtheta2(0) = dCE/dZ3 * dZ3/dtheta2(0)
= (L3-Y) * 1
dCE/dtheta2(1) = dCE/dZ3 * dZ3/dtheta2(1)
= (L3-Y) * L2(1)
dCE/dtheta2(2) = dCE/dZ3 * dZ3/dtheta2(2)
= (L3-Y) * L2(2)
dCE/dtheta2(3) = dCE/dZ3 * dZ3/dtheta2(3)
= (L3-Y) * L2(3)
好吧,这只是 np.dot(layer2.T, delta3),这就是我在 theta2d
中的内容同样地: Z2(0) = theta1(0,0) + theta1(1,0) * L1(1) + theta1(2,0) * L1(2) dZ2(0)/dtheta1(0,0) = 1 dZ2(0)/dtheta1(1,0) = L1(1) dZ2(0)/dtheta1(2,0) = L1(2)
Z2(1) = theta1(0,1) + theta1(1,1) * L1(1) + theta1(2,1) * L1(2)
dZ2(1)/dtheta1(0,1) = 1
dZ2(1)/dtheta1(1,1) = L1(1)
dZ2(1)/dtheta1(2,1) = L1(2)
Z2(2) = theta1(0,2) + theta1(1,2) * L1(1) + theta1(2,2) * L1(2)
dZ2(2)/dtheta1(0,2) = 1
dZ2(2)/dtheta1(1,2) = L1(1)
dZ2(2)/dtheta1(2,2) = L1(2)
而且,我们必须乘以 dCE/dZ2(0)、dCE/dZ2(1) 和 dCE/dZ2(2)(对于上面的三个组中的每一个。但是,如果你考虑一下,那就变成了 np.dot(layer1.T, delta2),这就是我在 theta1d.
中的内容现在,因为您在代码中执行了 Y-L3,所以您正在 添加 到 theta1 和 theta2...但是,这是推理。我们刚刚在上面计算的是 CE 对权重的导数。所以,这意味着,增加权重将 增加 CE。但是,我们真的想减少 CE..所以,我们(通常)减去。但是,因为在您的代码中,您正在计算负导数,所以您添加是正确的。
这有意义吗?