使用 scipy.optimize.minimize 实现反向传播梯度下降

Question

我正在尝试使用 numpy 和 scipy 为 MNIST 数字图像数据集训练自动编码器神经网络（3 层 - 2 层可见，1 层隐藏）。实现基于给定的符号 here 以下是我的代码：

def autoencoder_cost_and_grad(theta, visible_size, hidden_size, lambda_, data):
"""
The input theta is a 1-dimensional array because scipy.optimize.minimize expects
the parameters being optimized to be a 1d array.
First convert theta from a 1d array to the (W1, W2, b1, b2)
matrix/vector format, so that this follows the notation convention of the
lecture notes and tutorial.
You must compute the:
    cost : scalar representing the overall cost J(theta)
    grad : array representing the corresponding gradient of each element of theta
"""

  training_size = data.shape[1]
  # unroll theta to get (W1,W2,b1,b2) #
  W1 = theta[0:hidden_size*visible_size]
  W1 = W1.reshape(hidden_size,visible_size)

  W2 = theta[hidden_size*visible_size:2*hidden_size*visible_size]
  W2 = W2.reshape(visible_size,hidden_size)

  b1 = theta[2*hidden_size*visible_size:2*hidden_size*visible_size + hidden_size]
  b2 = theta[2*hidden_size*visible_size + hidden_size: 2*hidden_size*visible_size + hidden_size + visible_size]

  #feedforward pass
  a_l1 = data

  z_l2 = W1.dot(a_l1) + numpy.tile(b1,(training_size,1)).T
  a_l2 = sigmoid(z_l2)

  z_l3 = W2.dot(a_l2) + numpy.tile(b2,(training_size,1)).T
  a_l3 = sigmoid(z_l3)

  #backprop
  delta_l3 = numpy.multiply(-(data-a_l3),numpy.multiply(a_l3,1-a_l3))
  delta_l2 = numpy.multiply(W2.T.dot(delta_l3),
                             numpy.multiply(a_l2, 1 - a_l2))

  b2_derivative = numpy.sum(delta_l3,axis=1)/training_size
  b1_derivative = numpy.sum(delta_l2,axis=1)/training_size

  W2_derivative = numpy.dot(delta_l3,a_l2.T)/training_size + lambda_*W2
  #print(W2_derivative.shape)
  W1_derivative = numpy.dot(delta_l2,a_l1.T)/training_size + lambda_*W1

  W1_derivative = W1_derivative.reshape(hidden_size*visible_size)
  W2_derivative = W2_derivative.reshape(visible_size*hidden_size)
  b1_derivative = b1_derivative.reshape(hidden_size)
  b2_derivative = b2_derivative.reshape(visible_size)


  grad = numpy.concatenate((W1_derivative,W2_derivative,b1_derivative,b2_derivative))
  cost = 0.5*numpy.sum((data-a_l3)**2)/training_size + 0.5*lambda_*(numpy.sum(W1**2) + numpy.sum(W2**2))
  return cost,grad

我还实现了一个函数来估计数值梯度并验证我的实现的正确性（如下）。

def compute_gradient_numerical_estimate(J, theta, epsilon=0.0001):
"""
:param J: a loss (cost) function that computes the real-valued loss given parameters and data
:param theta: array of parameters
:param epsilon: amount to vary each parameter in order to estimate
                the gradient by numerical difference
:return: array of numerical gradient estimate
"""

  gradient = numpy.zeros(theta.shape)

  eps_vector = numpy.zeros(theta.shape)
  for i in range(0,theta.size):

      eps_vector[i] = epsilon
      cost1,grad1 = J(theta+eps_vector)
      cost2,grad2 = J(theta-eps_vector)
      gradient[i] = (cost1 - cost2)/(2*epsilon)
      eps_vector[i] = 0


  return gradient

数值估计值与函数计算值之间的差异范数约为 6.87165125021e-09，这似乎是可以接受的。我的主要问题似乎是使用 scipy.optimize.minimize 函数使梯度下降算法 "L-BGFGS-B" 工作，如下所示：

# theta is the 1-D array of(W1,W2,b1,b2)
J = lambda x: utils.autoencoder_cost_and_grad(theta, visible_size, hidden_size, lambda_, patches_train)
options_ = {'maxiter': 4000, 'disp': False}
result = scipy.optimize.minimize(J, theta, method='L-BFGS-B', jac=True, options=options_)

我从中得到以下输出：

scipy.optimize.minimize() details:
  fun: 90.802022224079778
 hess_inv: <16474x16474 LbfgsInvHessProduct with dtype=float64>
  jac: array([ -6.83667742e-06,  -2.74886002e-06,  -3.23531941e-06, ...,
     1.22425735e-01,   1.23425062e-01,   1.28091250e-01])
message: b'ABNORMAL_TERMINATION_IN_LNSRCH'
 nfev: 21
  nit: 0
 status: 2
success: False
    x: array([-0.06836677, -0.0274886 , -0.03235319, ...,  0.        ,
    0.        ,  0.        ])

现在，这个 seems to indicate that the error could mean that the gradient function implementation could be wrong? But my numerical gradient estimate seems to confirm that my implementation is correct. I have tried varying the initial weights by using a uniform distribution as specified here但是问题依然存在。我的反向传播实现有什么问题吗？

Answer 1

原来问题是这一行的语法错误（非常愚蠢）：

J = lambda x: utils.autoencoder_cost_and_grad(theta, visible_size, hidden_size, lambda_, patches_train)

我什至在函数声明中没有 lambda 参数 x。因此，每当调用 J 时，甚至都不会传递 theta 数组。

已修复：

J = lambda x: utils.autoencoder_cost_and_grad(x, visible_size, hidden_size, lambda_, patches_train)

使用 scipy.optimize.minimize 实现反向传播梯度下降

Implementing backpropagation gradient descent using scipy.optimize.minimize

numpy

scipy

backpropagation

neural-network