梯度检查适用于二进制，但对多 class 失败

Question

我已经在 Iris 数据集（只有两个标签）上为二进制 classification 构建了一个逻辑回归模型。这个模型在所有指标上都取得了良好的性能，并且还通过了 Andrew 给出的梯度检查吴。但是当我将输出激活从 "Sigmoid" 更改为 "Softmax" 并使其适用于多 class class 化时，即使性能指标非常好，这个模型也会失败梯度检查。

深度神经网络的相同模式，我使用 numpy 的实现通过了二进制 class 化的梯度检查但失败了多 class。

逻辑回归（二进制）：

我为我的功能（行数、列数）选择了行优先实现样式，但没有选择列优先样式，只是为了让理解和调试更直观。

维度： X = (100, 4 ) ;权重 = (4, 1); y = (100,1)

算法实现代码（二进制）：

import numpy as np

from sklearn.datasets import load_iris, load_digits
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import log_loss
from keras.losses import CategoricalCrossentropy
from scipy.special import softmax


def sigmoid(x):

   return ( (np.exp(x)) / (1 + np.exp(x) )  )




 dataset = load_iris()
 lb = LabelBinarizer() # Not used for binary classification


 X = dataset.data
 y = dataset.target



 data = np.concatenate((X[:100],y[:100].reshape(-1,1)), axis = 1)
 np.random.shuffle(data)

 X_train = data[:, :-1]
 X_b = np.c_[np.ones((X_train.shape[0] , 1)), X_train]

 y_train = data[:, -1].reshape(-1,1)

 num_unique_labels = len( np.unique(y_train) )


 Weights = np.random.randn(X_train.shape[1]+1, num_unique_labels-1)* np.sqrt(1./ (X_train.shape[1]+1)  )



 m = X_b.shape[0]

 yhat = sigmoid( np.dot(X_b, Weights))
 loss = log_loss(y_train, yhat)


 error = yhat - y_train

 gradient = (1./m) * ( X_b.T.dot(error)  )

梯度检查（二进制）：

 grad = gradient.reshape(-1,1)
 Weights_delta = Weights.reshape(-1,1)
 num_params = Weights_delta.shape[0]

 JP = np.zeros((num_params,1))
 JM = np.zeros((num_params,1))
 J_app = np.zeros((num_params,1))

 ep = float(1e-7)



for i in range(num_params):


  Weights_add = np.copy(Weights_delta)

  Weights_add[i] = Weights_add[i] + ep


  Z_add = sigmoid(np.dot(X_b, Weights_add.reshape(X_train.shape[1]+1,num_unique_labels-1)))

  JP[i] = log_loss( y_train, Z_add)


  Weights_sub = np.copy(Weights_delta)

  Weights_sub[i] = Weights_sub[i] - ep



  Z_sub = sigmoid(np.dot(X_b, Weights_sub.reshape(X_train.shape[1]+1,num_unique_labels-1)))

  JM[i] = log_loss( y_train, Z_sub)


  J_app[i] = (JP[i] - JM[i]) / (2*ep)

num = np.linalg.norm(grad - J_app)

denom = np.linalg.norm(grad) + np.linalg.norm(J_app)

num/denom

这会产生一个值 (num/denom)：8.244172628899919e-10。这证实了梯度计算是合适的。对于 multi_class 版本，我使用了与上面相同的梯度计算，但将输出激活更改为 Softmax（也取自 scipy ），并使用 axis = 1 来识别样本的最高概率，因为我的是行主要实现。

算法实现代码(multi_class):

*Dimensions: X = (150, 4) ; Weights = (4,3) ; y = (150, 3)*

import numpy as np

from sklearn.datasets import load_iris, load_digits
from sklearn.preprocessing import LabelBinarizer
from keras.losses import CategoricalCrossentropy
from scipy.special import softmax

CCE = CategoricalCrossentropy()


dataset = load_iris()
lb = LabelBinarizer()


X = dataset.data
y = dataset.target

lb.fit(y)

data = np.concatenate((X,y.reshape(-1,1)), axis = 1)
np.random.shuffle(data)

X_train = data[:, :-1]
X_b = np.c_[np.ones((X_train.shape[0] , 1)), X_train]


y_train = lb.transform(data[:, -1]).reshape(-1,3)


num_unique_labels = len( np.unique(y) )


Weights = np.random.randn(X_train.shape[1]+1, num_unique_labels) * np.sqrt(1./ (X_train.shape[1]+1)  )




m = X_b.shape[0]

yhat = softmax( np.dot(X_b, Weights), axis = 1)
cce_loss = CCE(y_train, yhat).numpy()

error = yhat - y_train

gradient = (1./m) * ( X_b.T.dot(error)  )

梯度检查（multi_class）：

grad = gradient.reshape(-1,1)
Weights_delta = Weights.reshape(-1,1)
num_params = Weights_delta.shape[0]

JP = np.zeros((num_params,1))
JM = np.zeros((num_params,1))
J_app = np.zeros((num_params,1))

ep = float(1e-7)

for i in range(num_params):

   Weights_add = np.copy(Weights_delta)

   Weights_add[i] = Weights_add[i] + ep


   Z_add = softmax(np.dot(X_b, Weights_add.reshape(X_train.shape[1]+1,num_unique_labels)),                           axis = 1)

   JP[i] = CCE( y_train, Z_add).numpy()


   Weights_sub = np.copy(Weights_delta)

   Weights_sub[i] = Weights_sub[i] - ep


   Z_sub = softmax(np.dot(X_b, Weights_sub.reshape(X_train.shape[1]+1,num_unique_labels)), axis = 1)

   JM[i] = CCE( y_train, Z_sub).numpy()


   J_app[i] = (JP[i] - JM[i]) / (2*ep)


num = np.linalg.norm(grad - J_app)

denom = np.linalg.norm(grad) + np.linalg.norm(J_app)

num/denom

这导致了一个值：0.3345。这显然是不可接受的差异。现在这让我想知道我是否可以首先信任我的二进制标签梯度检查代码。我已经在数字数据集上测试了这个逻辑回归代码（使用相同的梯度计算），性能再次非常好（>95% 的准确性、精确度、召回率）。真正让我着迷的是，即使模型的性能足够好，它也无法通过梯度检查。神经网络的情况与我之前提到的相同（二进制通过，multi_class 失败）。

我什至尝试了 Andrew Ng 在他的 coursera 课程中提供的代码，即使该代码对二进制通过但对多 class 失败。我似乎无法弄清楚我的代码哪里有错误，如果它们有小错误，它们怎么能在第一种情况下通过？

我查看了这些 SO，但我觉得它们的问题与我的不同：

Gradient checking in backpropogation

2.Checking the gradients when doing ...

3.problem with ann back-propagation ..

这是我要找的：

建议/更正我的二进制预测的梯度计算和梯度检查代码是否准确。
关于我在多 class 实现中可能出错的地方的建议/一般指导。

你会得到什么：(:P)

感谢 20 多岁的技术人员，他们认为每个文档页面都写得不好 :)

更新：更正了一些拼写错误并按照 Alex 的建议添加了更多代码行。我还意识到，在多 class 预测的情况下，我的近似梯度值（名为 J_app ）非常高（ 1e+2 ）；因为我将 (1./m) 乘以我的原始梯度（名称为梯度），所以我的原始梯度值约为（1e-1 到 1e-2）。

近似梯度值范围与我的原始梯度的明显差异解释了为什么我得到的最终值约为 (1e+1, 0.3345)。但是，我无法弄清楚的是，我该如何着手修复这个看似明显的错误。

Answer 1

你所有的计算似乎都是正确的。梯度检查失败的原因是因为 keras 中的 CategoricalCrossentropy 默认情况下是运行单精度。因此，由于权重的小更新导致的最终损失差异，您没有获得足够的精度。在脚本的开头添加以下行，您将得到 num/denom 通常在 1.e-9:

左右

import keras
keras.backend.set_floatx('float64')

梯度检查适用于二进制，但对多 class 失败

Gradient Checking works for binary but fails for multi class

python

machine-learning

neural-network

gradient-descent

logistic-regression