卷积层中的偏差真的会对测试精度产生影响吗?

Does bias in the convolutional layer really make a difference to the test accuracy?

我知道在小型网络中需要偏置来移动激活函数。但是对于具有多层 CNN、池化、dropout 和其他非线性激活的深度网络,Bias 真的会产生影响吗? 卷积滤波器正在学习局部特征并且对于使用给定的 conv 输出通道相同的偏差。

这不是 this link 的骗局。以上 link 仅解释了偏差在小型神经网络中的作用,并没有试图解释偏差在包含多个 CNN 层、dropouts、池化和非线性激活函数的深度网络中的作用。

我 运行 一个简单的实验,结果表明从转换层中去除偏差对最终测试精度没有影响。 训练了两个模型,测试精度几乎相同(在没有偏差的情况下稍微好一点。)

是否仅出于历史原因使用它们?

如果使用偏差不能提高准确性,我们不应该忽略它们吗?需要学习的参数更少。

如果有人比我知识更渊博,我将不胜感激,可以解释深度网络中这些偏见的重要性(如果有的话)。

这里是完整代码和实验结果bias-VS-no_bias experiment

batch_size = 16
patch_size = 5
depth = 16
num_hidden = 64

graph = tf.Graph()

with graph.as_default():

  # Input data.
  tf_train_dataset = tf.placeholder(
    tf.float32, shape=(batch_size, image_size, image_size, num_channels))
  tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
  tf_valid_dataset = tf.constant(valid_dataset)
  tf_test_dataset = tf.constant(test_dataset)

  # Variables.
  layer1_weights = tf.Variable(tf.truncated_normal(
      [patch_size, patch_size, num_channels, depth], stddev=0.1))
  layer1_biases = tf.Variable(tf.zeros([depth]))
  layer2_weights = tf.Variable(tf.truncated_normal(
      [patch_size, patch_size, depth, depth], stddev=0.1))
  layer2_biases = tf.Variable(tf.constant(1.0, shape=[depth]))
  layer3_weights = tf.Variable(tf.truncated_normal(
      [image_size // 4 * image_size // 4 * depth, num_hidden], stddev=0.1))
  layer3_biases = tf.Variable(tf.constant(1.0, shape=[num_hidden]))
  layer4_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_labels], stddev=0.1))
  layer4_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))

  # define a Model with bias .
  def model_with_bias(data):
    conv = tf.nn.conv2d(data, layer1_weights, [1, 2, 2, 1], padding='SAME')
    hidden = tf.nn.relu(conv + layer1_biases)
    conv = tf.nn.conv2d(hidden, layer2_weights, [1, 2, 2, 1], padding='SAME')
    hidden = tf.nn.relu(conv + layer2_biases)
    shape = hidden.get_shape().as_list()
    reshape = tf.reshape(hidden, [shape[0], shape[1] * shape[2] * shape[3]])
    hidden = tf.nn.relu(tf.matmul(reshape, layer3_weights) + layer3_biases)
    return tf.matmul(hidden, layer4_weights) + layer4_biases

  # define a Model without bias added in the convolutional layer.
  def model_without_bias(data):
    conv = tf.nn.conv2d(data, layer1_weights, [1, 2, 2, 1], padding='SAME')
    hidden = tf.nn.relu(conv ) # layer1_ bias is not added 
    conv = tf.nn.conv2d(hidden, layer2_weights, [1, 2, 2, 1], padding='SAME')
    hidden = tf.nn.relu(conv) # + layer2_biases)
    shape = hidden.get_shape().as_list()
    reshape = tf.reshape(hidden, [shape[0], shape[1] * shape[2] * shape[3]])
    # bias are added only in Fully connected layer(layer 3 and layer 4)
    hidden = tf.nn.relu(tf.matmul(reshape, layer3_weights) + layer3_biases)
    return tf.matmul(hidden, layer4_weights) + layer4_biases

  # Training computation.
  logits_with_bias = model_with_bias(tf_train_dataset)
  loss_with_bias = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits_with_bias))

  logits_without_bias = model_without_bias(tf_train_dataset)
  loss_without_bias = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits_without_bias))

  # Optimizer.
  optimizer_with_bias = tf.train.GradientDescentOptimizer(0.05).minimize(loss_with_bias)
  optimizer_without_bias = tf.train.GradientDescentOptimizer(0.05).minimize(loss_without_bias)

  # Predictions for the training, validation, and test data.
  train_prediction_with_bias = tf.nn.softmax(logits_with_bias)
  valid_prediction_with_bias = tf.nn.softmax(model_with_bias(tf_valid_dataset))
  test_prediction_with_bias = tf.nn.softmax(model_with_bias(tf_test_dataset))

  # Predictions for without
  train_prediction_without_bias = tf.nn.softmax(logits_without_bias)
  valid_prediction_without_bias = tf.nn.softmax(model_without_bias(tf_valid_dataset))
  test_prediction_without_bias = tf.nn.softmax(model_without_bias(tf_test_dataset))

num_steps = 1001

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print('Initialized')
  for step in range(num_steps):
    offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
    batch_data = train_dataset[offset:(offset + batch_size), :, :, :]
    batch_labels = train_labels[offset:(offset + batch_size), :]
    feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
    session.run(optimizer_with_bias, feed_dict=feed_dict)
    session.run(optimizer_without_bias, feed_dict = feed_dict)
  print('Test accuracy(with bias): %.1f%%' % accuracy(test_prediction_with_bias.eval(), test_labels))
  print('Test accuracy(without bias): %.1f%%' % accuracy(test_prediction_without_bias.eval(), test_labels))

输出:

已初始化

测试准确率(有偏差):90.5%

测试准确率(无偏差):90.6%

Biases are tuned alongside weights by learning algorithms such as gradient descent. biases differ from weights is that they are independent of the output from previous layers. Conceptually bias is caused by input from a neuron with a fixed activation of 1, and so is updated by subtracting the just the product of the delta value and learning rate.

在大型模型中,移除偏置输入几乎没有什么区别,因为每个节点都可以从其所有输入的平均激活中创建一个偏置节点,根据大数定律,这大致是正常的。在第一层,发生这种情况的能力取决于您的输入分布。 在小型网络上,当然需要偏置输入,但在大型网络上,去掉它几乎没有什么区别

虽然在大网络中没有区别,但还是要看网络架构。例如在 LSTM 中:

Most applications of LSTMs simply initialize the LSTMs with small random weights which works well on many problems. But this initialization effectively sets the forget gate to 0.5. This introduces a vanishing gradient with a factor of 0.5 per timestep, which can cause problems whenever the long term dependencies are particularly severe. This problem is addressed by simply initializing the forget gates bias to a large value such as 1 or 2. By doing so, the forget gate will be initialized to a value that is close to 1, enabling gradient flow.

另请参阅:

在大多数网络中,你在转换层之后有一个 batchnorm 层,它有一个偏差。所以如果你有一个 batchnorm 层,就没有好处。看:

否则,从数学的角度来看,你正在学习不同的函数。然而,事实证明,特别是如果你有一个非常复杂的网络来解决一个简单的问题,你可能会在没有偏差的情况下实现几乎相同的结果,而不是有偏差,但最终会使用更多的参数。以我的经验,使用比所需参数多 2-4 倍的参数很少会损害深度学习的性能——尤其是在进行正则化时。因此,很难注意到任何差异。但是,您可能会尝试使用几个通道(我认为网络的深度不如卷积通道的数量那么重要)并查看偏差是否有所作为。我猜是的。