Adam Optimizer 权重衰减的正确方法是什么
What is the proper way to weight decay for Adam Optimizer
由于 Adam Optimizer 为梯度保留了一对 运行 平均值,例如 mean/variance,我想知道它应该如何正确处理权重衰减。我见过两种实现方式。
仅根据 objective 损失从梯度更新 mean/variance,在每个小批量中显式衰减权重。 (以下代码摘自https://github.com/dmlc/mxnet/blob/v0.7.0/python/mxnet/optimizer.py)
weight[:] -= lr*mean/(sqrt(variance) + self.epsilon)
wd = self._get_wd(index)
if wd > 0.:
weight[:] -= (lr * wd) * weight
根据 objective 损失 + 正则化损失从梯度更新 mean/variance,并像往常一样更新权重。 (以下代码摘自https://github.com/dmlc/mxnet/blob/master/src/operator/optimizer_op-inl.h#L210)
grad = scalar<DType>(param.rescale_grad) * grad +
scalar<DType>(param.wd) * weight;
// stuff
Assign(out, req[0],
weight -
scalar<DType>(param.lr) * mean /
(F<square_root>(var) + scalar<DType>(param.epsilon)));
这两种方法有时会在训练结果上表现出显着差异。而且我实际上认为第一个更有意义(并且发现它有时会给出更好的结果)。 Caffe和旧版mxnet采用第一种方式,torch、tensorflow和新版mxnet采用第二种方式。
非常感谢您的帮助!
编辑: 另见 this PR 刚刚合并到 TF 中。
当使用纯 SGD(没有动量)作为优化器时,权重衰减与向损失添加 L2 正则化项是一回事。 当使用任何其他优化器时,情况并非如此。
权重衰减(不知道这里如何 TeX,所以请原谅我的伪符号):
w[t+1] = w[t] - learning_rate * dw - weight_decay * w
L2-正则化:
loss = actual_loss + lambda * 1/2 sum(||w||_2 for w in network_params)
计算 L2 正则化中额外项的梯度得到 lambda * w
,然后将其插入 SGD 更新方程
dloss_dw = dactual_loss_dw + lambda * w
w[t+1] = w[t] - learning_rate * dw
与权重衰减相同,但将 lambda
与 learning_rate
混合。任何其他优化器,甚至是具有动量的 SGD,都会为权重衰减提供与 L2 正则化不同的更新规则!请参阅 Fixing weight decay in Adam for more details. (Edit: AFAIK, this 1987 Hinton paper 介绍的论文 "weight decay",字面意思为第 10 页的 "each time the weights are updated, their magnitude is also decremented by 0.4%")
也就是说,TensorFlow 中似乎还不支持 "proper" 权重衰减。有几个问题在讨论,具体是因为上面的论文。
实现它的一种可能方法是编写一个操作,在每个优化器步骤之后手动执行衰减步骤。另一种方法,也就是我目前正在做的,是使用一个额外的 SGD 优化器来进行权重衰减,并且 "attaching" 它到你的 train_op
。不过,这两种方法都只是粗略的变通方法。我当前的代码:
# In the network definition:
with arg_scope([layers.conv2d, layers.dense],
weights_regularizer=layers.l2_regularizer(weight_decay)):
# define the network.
loss = # compute the actual loss of your problem.
train_op = optimizer.minimize(loss, global_step=global_step)
if args.weight_decay not in (None, 0):
with tf.control_dependencies([train_op]):
sgd = tf.train.GradientDescentOptimizer(learning_rate=1.0)
train_op = sgd.minimize(tf.add_n(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)))
这在某种程度上利用了 TensorFlow 提供的簿记功能。请注意,arg_scope
负责将每一层的 L2 正则化项附加到 REGULARIZATION_LOSSES
图键,然后我使用 SGD 对其进行总结和优化,如上所示,对应于实际重量衰减。
希望有帮助,如果有人为此获得了更好的代码片段,或者 TensorFlow 更好地实现了它(即在优化器中),请分享。
我遇到了同样的问题。我认为我从 here 获得的这段代码对你有用。它通过继承 tf.train.Optimizer
来实现权重衰减 adam 优化器。这是我找到的最干净的解决方案:
class AdamWeightDecayOptimizer(tf.train.Optimizer):
"""A basic Adam optimizer that includes "correct" L2 weight decay."""
def __init__(self,
learning_rate,
weight_decay_rate=0.0,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-6,
exclude_from_weight_decay=None,
name="AdamWeightDecayOptimizer"):
"""Constructs a AdamWeightDecayOptimizer."""
super(AdamWeightDecayOptimizer, self).__init__(False, name)
self.learning_rate = learning_rate
self.weight_decay_rate = weight_decay_rate
self.beta_1 = beta_1
self.beta_2 = beta_2
self.epsilon = epsilon
self.exclude_from_weight_decay = exclude_from_weight_decay
def apply_gradients(self, grads_and_vars, global_step=None, name=None):
"""See base class."""
assignments = []
for (grad, param) in grads_and_vars:
if grad is None or param is None:
continue
param_name = self._get_variable_name(param.name)
m = tf.get_variable(
name=param_name + "/adam_m",
shape=param.shape.as_list(),
dtype=tf.float32,
trainable=False,
initializer=tf.zeros_initializer())
v = tf.get_variable(
name=param_name + "/adam_v",
shape=param.shape.as_list(),
dtype=tf.float32,
trainable=False,
initializer=tf.zeros_initializer())
# Standard Adam update.
next_m = (
tf.multiply(self.beta_1, m) + tf.multiply(1.0 - self.beta_1, grad))
next_v = (
tf.multiply(self.beta_2, v) + tf.multiply(1.0 - self.beta_2,
tf.square(grad)))
update = next_m / (tf.sqrt(next_v) + self.epsilon)
# Just adding the square of the weights to the loss function is *not*
# the correct way of using L2 regularization/weight decay with Adam,
# since that will interact with the m and v parameters in strange ways.
#
# Instead we want ot decay the weights in a manner that doesn't interact
# with the m/v parameters. This is equivalent to adding the square
# of the weights to the loss with plain (non-momentum) SGD.
if self._do_use_weight_decay(param_name):
update += self.weight_decay_rate * param
update_with_lr = self.learning_rate * update
next_param = param - update_with_lr
assignments.extend(
[param.assign(next_param),
m.assign(next_m),
v.assign(next_v)])
return tf.group(*assignments, name=name)
def _do_use_weight_decay(self, param_name):
"""Whether to use L2 weight decay for `param_name`."""
if not self.weight_decay_rate:
return False
if self.exclude_from_weight_decay:
for r in self.exclude_from_weight_decay:
if re.search(r, param_name) is not None:
return False
return True
def _get_variable_name(self, param_name):
"""Get the variable name from the tensor name."""
m = re.match("^(.*):\d+$", param_name)
if m is not None:
param_name = m.group(1)
return param_name
而且你可以通过下面的方式使用它(我做了一些改变让它在更一般的上下文中有用),这个函数将 return 一个 train_op
可以用在会话:
def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps):
"""Creates an optimizer training op."""
global_step = tf.train.get_or_create_global_step()
learning_rate = tf.constant(value=init_lr, shape=[], dtype=tf.float32)
# Implements linear decay of the learning rate.
learning_rate = tf.train.polynomial_decay(
learning_rate,
global_step,
num_train_steps,
end_learning_rate=0.0,
power=1.0,
cycle=False)
# Implements linear warmup. I.e., if global_step < num_warmup_steps, the
# learning rate will be `global_step/num_warmup_steps * init_lr`.
if num_warmup_steps:
global_steps_int = tf.cast(global_step, tf.int32)
warmup_steps_int = tf.constant(num_warmup_steps, dtype=tf.int32)
global_steps_float = tf.cast(global_steps_int, tf.float32)
warmup_steps_float = tf.cast(warmup_steps_int, tf.float32)
warmup_percent_done = global_steps_float / warmup_steps_float
warmup_learning_rate = init_lr * warmup_percent_done
is_warmup = tf.cast(global_steps_int < warmup_steps_int, tf.float32)
learning_rate = (
(1.0 - is_warmup) * learning_rate + is_warmup * warmup_learning_rate)
# It is recommended that you use this optimizer for fine tuning, since this
# is how the model was trained (note that the Adam m/v variables are NOT
# loaded from init_checkpoint.)
optimizer = AdamWeightDecayOptimizer(
learning_rate=learning_rate,
weight_decay_rate=0.01,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-6)
tvars = tf.trainable_variables()
grads = tf.gradients(loss, tvars)
# You can do clip gradients if you need in this step(in general it is not neccessary)
# (grads, _) = tf.clip_by_global_norm(grads, clip_norm=1.0)
train_op = optimizer.apply_gradients(
zip(grads, tvars), global_step=global_step)
# Normally the global step update is done inside of `apply_gradients`.
# However, `AdamWeightDecayOptimizer` doesn't do this. But if you use
# a different optimizer, you should probably take this line out.
new_global_step = global_step + 1
train_op = tf.group(train_op, [global_step.assign(new_global_step)])
return train_op
由于 Adam Optimizer 为梯度保留了一对 运行 平均值,例如 mean/variance,我想知道它应该如何正确处理权重衰减。我见过两种实现方式。
仅根据 objective 损失从梯度更新 mean/variance,在每个小批量中显式衰减权重。 (以下代码摘自https://github.com/dmlc/mxnet/blob/v0.7.0/python/mxnet/optimizer.py)
weight[:] -= lr*mean/(sqrt(variance) + self.epsilon) wd = self._get_wd(index) if wd > 0.: weight[:] -= (lr * wd) * weight
根据 objective 损失 + 正则化损失从梯度更新 mean/variance,并像往常一样更新权重。 (以下代码摘自https://github.com/dmlc/mxnet/blob/master/src/operator/optimizer_op-inl.h#L210)
grad = scalar<DType>(param.rescale_grad) * grad + scalar<DType>(param.wd) * weight; // stuff Assign(out, req[0], weight - scalar<DType>(param.lr) * mean / (F<square_root>(var) + scalar<DType>(param.epsilon)));
这两种方法有时会在训练结果上表现出显着差异。而且我实际上认为第一个更有意义(并且发现它有时会给出更好的结果)。 Caffe和旧版mxnet采用第一种方式,torch、tensorflow和新版mxnet采用第二种方式。
非常感谢您的帮助!
编辑: 另见 this PR 刚刚合并到 TF 中。
当使用纯 SGD(没有动量)作为优化器时,权重衰减与向损失添加 L2 正则化项是一回事。 当使用任何其他优化器时,情况并非如此。
权重衰减(不知道这里如何 TeX,所以请原谅我的伪符号):
w[t+1] = w[t] - learning_rate * dw - weight_decay * w
L2-正则化:
loss = actual_loss + lambda * 1/2 sum(||w||_2 for w in network_params)
计算 L2 正则化中额外项的梯度得到 lambda * w
,然后将其插入 SGD 更新方程
dloss_dw = dactual_loss_dw + lambda * w
w[t+1] = w[t] - learning_rate * dw
与权重衰减相同,但将 lambda
与 learning_rate
混合。任何其他优化器,甚至是具有动量的 SGD,都会为权重衰减提供与 L2 正则化不同的更新规则!请参阅 Fixing weight decay in Adam for more details. (Edit: AFAIK, this 1987 Hinton paper 介绍的论文 "weight decay",字面意思为第 10 页的 "each time the weights are updated, their magnitude is also decremented by 0.4%")
也就是说,TensorFlow 中似乎还不支持 "proper" 权重衰减。有几个问题在讨论,具体是因为上面的论文。
实现它的一种可能方法是编写一个操作,在每个优化器步骤之后手动执行衰减步骤。另一种方法,也就是我目前正在做的,是使用一个额外的 SGD 优化器来进行权重衰减,并且 "attaching" 它到你的 train_op
。不过,这两种方法都只是粗略的变通方法。我当前的代码:
# In the network definition:
with arg_scope([layers.conv2d, layers.dense],
weights_regularizer=layers.l2_regularizer(weight_decay)):
# define the network.
loss = # compute the actual loss of your problem.
train_op = optimizer.minimize(loss, global_step=global_step)
if args.weight_decay not in (None, 0):
with tf.control_dependencies([train_op]):
sgd = tf.train.GradientDescentOptimizer(learning_rate=1.0)
train_op = sgd.minimize(tf.add_n(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)))
这在某种程度上利用了 TensorFlow 提供的簿记功能。请注意,arg_scope
负责将每一层的 L2 正则化项附加到 REGULARIZATION_LOSSES
图键,然后我使用 SGD 对其进行总结和优化,如上所示,对应于实际重量衰减。
希望有帮助,如果有人为此获得了更好的代码片段,或者 TensorFlow 更好地实现了它(即在优化器中),请分享。
我遇到了同样的问题。我认为我从 here 获得的这段代码对你有用。它通过继承 tf.train.Optimizer
来实现权重衰减 adam 优化器。这是我找到的最干净的解决方案:
class AdamWeightDecayOptimizer(tf.train.Optimizer):
"""A basic Adam optimizer that includes "correct" L2 weight decay."""
def __init__(self,
learning_rate,
weight_decay_rate=0.0,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-6,
exclude_from_weight_decay=None,
name="AdamWeightDecayOptimizer"):
"""Constructs a AdamWeightDecayOptimizer."""
super(AdamWeightDecayOptimizer, self).__init__(False, name)
self.learning_rate = learning_rate
self.weight_decay_rate = weight_decay_rate
self.beta_1 = beta_1
self.beta_2 = beta_2
self.epsilon = epsilon
self.exclude_from_weight_decay = exclude_from_weight_decay
def apply_gradients(self, grads_and_vars, global_step=None, name=None):
"""See base class."""
assignments = []
for (grad, param) in grads_and_vars:
if grad is None or param is None:
continue
param_name = self._get_variable_name(param.name)
m = tf.get_variable(
name=param_name + "/adam_m",
shape=param.shape.as_list(),
dtype=tf.float32,
trainable=False,
initializer=tf.zeros_initializer())
v = tf.get_variable(
name=param_name + "/adam_v",
shape=param.shape.as_list(),
dtype=tf.float32,
trainable=False,
initializer=tf.zeros_initializer())
# Standard Adam update.
next_m = (
tf.multiply(self.beta_1, m) + tf.multiply(1.0 - self.beta_1, grad))
next_v = (
tf.multiply(self.beta_2, v) + tf.multiply(1.0 - self.beta_2,
tf.square(grad)))
update = next_m / (tf.sqrt(next_v) + self.epsilon)
# Just adding the square of the weights to the loss function is *not*
# the correct way of using L2 regularization/weight decay with Adam,
# since that will interact with the m and v parameters in strange ways.
#
# Instead we want ot decay the weights in a manner that doesn't interact
# with the m/v parameters. This is equivalent to adding the square
# of the weights to the loss with plain (non-momentum) SGD.
if self._do_use_weight_decay(param_name):
update += self.weight_decay_rate * param
update_with_lr = self.learning_rate * update
next_param = param - update_with_lr
assignments.extend(
[param.assign(next_param),
m.assign(next_m),
v.assign(next_v)])
return tf.group(*assignments, name=name)
def _do_use_weight_decay(self, param_name):
"""Whether to use L2 weight decay for `param_name`."""
if not self.weight_decay_rate:
return False
if self.exclude_from_weight_decay:
for r in self.exclude_from_weight_decay:
if re.search(r, param_name) is not None:
return False
return True
def _get_variable_name(self, param_name):
"""Get the variable name from the tensor name."""
m = re.match("^(.*):\d+$", param_name)
if m is not None:
param_name = m.group(1)
return param_name
而且你可以通过下面的方式使用它(我做了一些改变让它在更一般的上下文中有用),这个函数将 return 一个 train_op
可以用在会话:
def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps):
"""Creates an optimizer training op."""
global_step = tf.train.get_or_create_global_step()
learning_rate = tf.constant(value=init_lr, shape=[], dtype=tf.float32)
# Implements linear decay of the learning rate.
learning_rate = tf.train.polynomial_decay(
learning_rate,
global_step,
num_train_steps,
end_learning_rate=0.0,
power=1.0,
cycle=False)
# Implements linear warmup. I.e., if global_step < num_warmup_steps, the
# learning rate will be `global_step/num_warmup_steps * init_lr`.
if num_warmup_steps:
global_steps_int = tf.cast(global_step, tf.int32)
warmup_steps_int = tf.constant(num_warmup_steps, dtype=tf.int32)
global_steps_float = tf.cast(global_steps_int, tf.float32)
warmup_steps_float = tf.cast(warmup_steps_int, tf.float32)
warmup_percent_done = global_steps_float / warmup_steps_float
warmup_learning_rate = init_lr * warmup_percent_done
is_warmup = tf.cast(global_steps_int < warmup_steps_int, tf.float32)
learning_rate = (
(1.0 - is_warmup) * learning_rate + is_warmup * warmup_learning_rate)
# It is recommended that you use this optimizer for fine tuning, since this
# is how the model was trained (note that the Adam m/v variables are NOT
# loaded from init_checkpoint.)
optimizer = AdamWeightDecayOptimizer(
learning_rate=learning_rate,
weight_decay_rate=0.01,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-6)
tvars = tf.trainable_variables()
grads = tf.gradients(loss, tvars)
# You can do clip gradients if you need in this step(in general it is not neccessary)
# (grads, _) = tf.clip_by_global_norm(grads, clip_norm=1.0)
train_op = optimizer.apply_gradients(
zip(grads, tvars), global_step=global_step)
# Normally the global step update is done inside of `apply_gradients`.
# However, `AdamWeightDecayOptimizer` doesn't do this. But if you use
# a different optimizer, you should probably take this line out.
new_global_step = global_step + 1
train_op = tf.group(train_op, [global_step.assign(new_global_step)])
return train_op