在标准 CNN 上应用批量归一化的位置

Where to apply batch normalization on standard CNNs

我有以下架构:

Conv1
Relu1
Pooling1
Conv2
Relu2
Pooling3
FullyConnect1
FullyConnect2

我的问题是,我在哪里应用批量归一化?在 TensorFlow 中执行此操作的最佳函数是什么?

Francois Chollet 的原作 batch-norm paper prescribes using the batch-norm before ReLU activation. But there is evidence that it's probably better to use batchnorm after the activation. Here's a comment on Keras GitHub

... I can guarantee that recent code written by Christian [Szegedy] applies relu before BN. It is still occasionally a topic of debate, though.

关于你的第二个问题:在tensorflow中,你可以使用高级tf.layers.batch_normalization function, or a low-level tf.nn.batch_normalization

关于这个问题存在一些争论。 and this keras thread are examples of the debate. Andrew Ng says that batch normalization should be applied immediately before the non-linearity of the current layer. The authors of the BN paper said that as well, but now according to François Chollet on the keras thread, the BN paper authors use BN after the activation layer. On the other hand, there are some benchmarks such as the one discussed on this torch-residual-networks github issue 这表明 BN 在激活层之后表现更好。

我目前的观点(有待更正)是你应该在激活层之后做 BN,如果你有预算并且想挤出额外的精度,请在激活层之前尝试。

因此,将批量归一化添加到您的 CNN 将如下所示:

Conv1
Relu1
BatchNormalization
Pooling1
Conv2
Relu2
BatchNormalization
Pooling3
FullyConnect1
BatchNormalization
FullyConnect2
BatchNormalization

除了原始论文在激活之前使用批量归一化之外,Bengio 的书 Deep Learning, section 8.7.1 给出了一些原因,说明为什么在激活之后(或直接在输入到下一层之前)应用批量归一化可能会导致一些问题:

It is natural to wonder whether we should apply batch normalization to the input X, or to the transformed value XW+b. Ioffe and Szegedy (2015) recommend the latter. More specifically, XW+b should be replaced by a normalized version of XW. The bias term should be omitted because it becomes redundant with the β parameter applied by the batch normalization reparameterization. The input to a layer is usually the output of a nonlinear activation function such as the rectified linear function in a previous layer. The statistics of the input are thus more non-Gaussian and less amenable to standardization by linear operations.

换句话说,如果我们使用 relu 激活,所有负值都映射为零。这可能会导致平均值已经非常接近于零,但剩余数据的分布将严重偏向右侧。尝试将该数据标准化为漂亮的钟形曲线可能不会产生最佳结果。对于 relu 系列之外的激活,这可能不是什么大问题。

有些人报告说在激活后进行批量归一化会获得更好的结果,而其他人则在激活前进行批量归一化会获得更好的结果。这是一场公开辩论。我建议您使用这两种配置测试您的模型,如果激活后的批量归一化显着降低了验证损失,请改用该配置。