如何为 TensorFlow 中的各个层定义权重衰减？

Question

在 CUDA ConvNet 中，我们可以为每一层编写如下内容 (source)：

[conv32]
epsW=0.001
epsB=0.002
momW=0.9
momB=0.9
wc=0

其中wc=0指的是L2权重衰减。

如何在 TensorFlow 中实现相同的目标？

Answer 1

您可以将要添加权重衰减的所有变量添加到集合名称'variables'，然后计算整个集合的 L2 范数权重衰减。

  # Create your variables
  weights = tf.get_variable('weights', collections=['variables'])

  with tf.variable_scope('weights_norm') as scope:
    weights_norm = tf.reduce_sum(
      input_tensor = WEIGHT_DECAY_FACTOR*tf.pack(
          [tf.nn.l2_loss(i) for i in tf.get_collection('weights')]
      ),
      name='weights_norm'
  )

  # Add the weight decay loss to another collection called losses
  tf.add_to_collection('losses', weights_norm)

  # Add the other loss components to the collection losses     
  # ...

  # To calculate your total loss
  tf.add_n(tf.get_collection('losses'), name='total_loss')

Answer 2

get_variable(
name,
shape=None,
dtype=None,
initializer=None,
regularizer=None,
trainable=True,
collections=None,
caching_device=None,
partitioner=None,
validate_shape=True,
use_resource=None,
custom_getter=None)

这是tensorflow函数的用法get_variable。您可以轻松指定正则化器进行权重衰减。

示例如下：

weight_decay = tf.constant(0.0005, dtype=tf.float32) # your weight decay rate, must be a scalar tensor.
W = tf.get_variable(name='weight', shape=[4, 4, 256, 512], regularizer=tf.contrib.layers.l2_regularizer(weight_decay))

Answer 3

目前的两个答案都是错误的，因为它们没有给你 "weight decay as in cuda-convnet" 而是 L2 正则化，这是不同的。

当使用纯 SGD（没有动量）作为优化器时，权重衰减与向损失添加 L2 正则化项是一回事。 当使用任何其他优化器时，情况并非如此。

权重衰减（不知道这里如何 TeX，所以请原谅我的伪符号）：

w[t+1] = w[t] - learning_rate * dw - weight_decay * w

L2-正则化：

loss = actual_loss + lambda * 1/2 sum(||w||_2 for w in network_params)

计算 L2 正则化中额外项的梯度得到 lambda * w，然后将其插入 SGD 更新方程

dloss_dw = dactual_loss_dw + lambda * w
w[t+1] = w[t] - learning_rate * dw

与权重衰减相同，但将 lambda 与 learning_rate 混合。任何其他优化器，甚至是具有动量的 SGD，都会为权重衰减提供与 L2 正则化不同的更新规则！请参阅 Fixing weight decay in Adam for more details. (Edit: AFAIK, this 1987 Hinton paper 介绍的论文 "weight decay"，第 10 页的字面意思为 "each time the weights are updated, their magnitude is also decremented by 0.4%")

也就是说，TensorFlow 中似乎还不支持 "proper" 权重衰减。有几个问题在讨论，具体是因为上面的论文。

实现它的一种可能方法是编写一个操作，在每个优化器步骤之后手动执行衰减步骤。另一种方法，也就是我目前正在做的，是使用额外的 SGD 优化器来进行权重衰减，并且 "attaching" 它到你的 train_op。不过，这两种方法都只是粗略的变通方法。我当前的代码：

# In the network definition:
with arg_scope([layers.conv2d, layers.dense],
               weights_regularizer=layers.l2_regularizer(weight_decay)):
    # define the network.

loss = # compute the actual loss of your problem.
train_op = optimizer.minimize(loss, global_step=global_step)
if args.weight_decay not in (None, 0):
    with tf.control_dependencies([train_op]):
        sgd = tf.train.GradientDescentOptimizer(learning_rate=1.0)
        train_op = sgd.minimize(tf.add_n(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)))

这在某种程度上利用了 TensorFlow 提供的簿记功能。请注意，arg_scope 负责将每一层的 L2 正则化项附加到 REGULARIZATION_LOSSES 图键，然后我使用 SGD 对其进行总结和优化，如上所示，对应于实际重量衰减。

希望有帮助，如果有人为此获得了更好的代码片段，或者 TensorFlow 更好地实现了它（即在优化器中），请分享。

编辑： 另见 this PR 刚刚合并到 TF 中。

如何为 TensorFlow 中的各个层定义权重衰减？

How to define weight decay for individual layers in TensorFlow?

tensorflow