神经网络训练过程中nans的常见原因

Question

我注意到训练期间经常出现 NAN 被引入。

通常它似乎是由 inner-product/fully-connected 中的权重或膨胀的卷积层引入的。

这是因为梯度计算爆炸了吗？还是因为权重初始化（如果是，为什么权重初始化会有这个效果）？或者它可能是由输入数据的性质引起的？

这里的首要问题很简单：在训练过程中出现 NAN 的最常见原因是什么？其次，有哪些方法可以解决这个问题（以及为什么他们这样做工作）？

Answer 1

这种现象我遇到过好几次。以下是我的观察：

梯度放大

原因： 大梯度会使学习过程偏离轨道。

你应该期待什么： 查看运行时日志，你应该查看每次迭代的损失值。你会注意到从迭代到迭代，损失开始显着增长，最终损失会太大而无法用浮点变量表示，它将变成nan .

你能做什么： 将 base_lr（在 solver.prototxt 中）减少一个数量级（至少）。如果您有多个损失层，您应该检查日志以查看哪个层导致梯度爆炸并减少该特定层的 loss_weight（在 train_val.prototxt 中），而不是一般的 base_lr.

错误的学习率策略和参数

原因：caffe无法计算出有效的学习率，取而代之的是'inf'或'nan'，这个无效率乘以所有更新，从而使所有更新无效参数.

你应该期待什么： 查看运行时日志，你应该看到学习率本身变成了'nan'，例如：

... sgd_solver.cpp:106] Iteration 0, lr = -nan

你能做什么： 在你的 'solver.prototxt' 文件中修复影响学习率的所有参数。
例如，如果你使用 lr_policy: "poly" 而你忘记定义 max_iter 参数，你最终会得到 lr = nan...
有关 caffe 中学习率的更多信息，请参阅。

错误的损失函数

原因：有时在损失层计算损失会导致出现nan。例如，Feeding InfogainLoss layer with non-normalized values、使用带有错误的自定义损失层等

你应该期待什么： 查看运行时日志你可能不会注意到任何异常：损失逐渐减少，突然 nan出现。

你能做什么：看看你能不能重现错误，将打印输出添加到损失层并调试错误。

例如：有一次我使用了一个损失函数，该损失函数通过标签在批次中出现的频率来归一化惩罚。碰巧的是，如果其中一个训练标签根本没有出现在批次中——计算出的损失会产生 nans。在这种情况下，使用足够大的批次（相对于集合中的标签数量）足以避免此错误。

输入错误

原因：您有一个包含nan的输入！

你应该期待什么： 一旦学习过程“命中”这个错误的输入 - 输出就变成 nan。查看运行时日志，您可能不会注意到任何异常：损失逐渐减少，突然出现 nan。

你能做什么： 重新构建你的输入数据集 (lmdb/leveldn/hdf5...) 确保你的 [=133] 中没有错误的图像文件=] 设置。对于调试，您可以构建一个简单的网络来读取输入层，在它上面有一个虚拟损失并运行所有输入：如果其中一个有故障，这个虚拟网络也应该产生 nan.

步长大于 `"Pooling"` 层中的内核大小

出于某种原因，选择 stride > kernel_size 进行池化可能会导致 nans。例如：

layer {
  name: "faulty_pooling"
  type: "Pooling"
  bottom: "x"
  top: "y"
  pooling_param {
    pool: AVE
    stride: 5
    kernel: 3
  }
}

在 y 中获得 nan 的结果。

`"BatchNorm"`

不稳定

据报道，在某些设置下 "BatchNorm" 层可能会由于数值不稳定而输出 nans。
issue was raised in bvlc/caffe and PR #5136 正在尝试修复它。

最近，我开始意识到debug_info flag: setting debug_info: true in 'solver.prototxt' will make caffe print to log more debug information (including gradient magnitudes and activation values) during training: This information can 。

Answer 2

这个答案不是关于 nan 的原因，而是提出了一种帮助调试它的方法。你可以有这个 python 层：

class checkFiniteLayer(caffe.Layer):
  def setup(self, bottom, top):
    self.prefix = self.param_str
  def reshape(self, bottom, top):
    pass
  def forward(self, bottom, top):
    for i in xrange(len(bottom)):
      isbad = np.sum(1-np.isfinite(bottom[i].data[...]))
      if isbad>0:
        raise Exception("checkFiniteLayer: %s forward pass bottom %d has %.2f%% non-finite elements" %
                        (self.prefix,i,100*float(isbad)/bottom[i].count))
  def backward(self, top, propagate_down, bottom):
    for i in xrange(len(top)):
      if not propagate_down[i]:
        continue
      isf = np.sum(1-np.isfinite(top[i].diff[...]))
        if isf>0:
          raise Exception("checkFiniteLayer: %s backward pass top %d has %.2f%% non-finite elements" %
                          (self.prefix,i,100*float(isf)/top[i].count))

在您怀疑的某些点将这一层添加到您的 train_val.prototxt 中可能会引起麻烦：

layer {
  type: "Python"
  name: "check_loss"
  bottom: "fc2"
  top: "fc2"  # "in-place" layer
  python_param {
    module: "/path/to/python/file/check_finite_layer.py" # must be in $PYTHONPATH
    layer: "checkFiniteLayer"
    param_str: "prefix-check_loss" # string for printouts
  }
}

Answer 3

我试图构建一个稀疏自动编码器并在其中包含多个层以诱导稀疏性。虽然运行我的网，我遇到了 NaN 的。在删除一些层时（在我的例子中，我实际上不得不删除 1），我发现 NaN 消失了。所以，我猜太稀疏也可能导致 NaN（可能已经调用了一些 0/0 计算！？）

Answer 4

在我的例子中，没有在 convolution/deconvolution 层中设置偏差是原因。

解决方法：在卷积层参数中加入如下内容

bias_filler {
      type: "constant"
      value: 0
    }

Answer 5

learning_rate偏高，应该降低 RNN 代码的准确度为 nan，select 它修复的学习率的低值

神经网络训练过程中nans的常见原因

Common causes of nans during training of neural networks

machine-learning

neural-network

gradient-descent

deep-learning

caffe

梯度放大

错误的学习率策略和参数

错误的损失函数

输入错误

步长大于 `"Pooling"` 层中的内核大小

`"BatchNorm"`

神经网络训练过程中nans的常见原因

Common causes of nans during training of neural networks

machine-learning

neural-network

gradient-descent

deep-learning

caffe

梯度放大

错误的学习率策略和参数

错误的损失函数

输入错误

步长大于 "Pooling" 层中的内核大小

"BatchNorm"

步长大于 `"Pooling"` 层中的内核大小

`"BatchNorm"`