TensorFlow 中的梯度下降 vs Adagrad vs Momentum

Question

我正在研究 TensorFlow 以及如何使用它，即使我不是神经网络和深度学习专家（只是基础知识）。

按照教程，我不了解三种损失优化器之间的实际差异。我看了API，明白了原理，但是我的问题是：

1.什么时候使用一个而不是其他的更好？

2。是否有重要差异需要了解？

Answer 1

这里根据我的理解做一个简单的解释：

momentum helps SGD 沿着相关方向导航并缓和不相关的振荡。它只是将前一步方向的一小部分添加到当前步骤。这实现了在正确方向上的速度放大并减弱了在错误方向上的振荡。该分数通常在 (0, 1) 范围内。使用自适应动量也很有意义。在开始学习时，大动量只会阻碍你的进步，所以使用 0.01 之类的东西是有意义的，一旦所有高梯度消失，你就可以使用更大的动量。动量有一个问题：当我们非常接近目标时，我们的动量在大多数情况下非常高，它不知道应该减速。这可能会导致它错过或在最小值附近振荡
nesterov 加速梯度 通过提早开始减速克服了这个问题。在动量中，我们首先计算梯度，然后在那个方向上跳跃，通过我们之前的动量放大。 NAG 做同样的事情，但顺序不同：首先我们根据存储的信息进行大跳跃，然后计算梯度并进行小幅修正。这个看似无关紧要的更改提供了显着的实际加速。
AdaGrad 或自适应梯度允许学习率根据参数进行调整。它对不频繁的参数执行较大的更新，对频繁的参数执行较小的更新。正因为如此，它非常适合稀疏数据（NLP 或图像识别）。另一个优点是它基本上消除了调整学习率的需要。每个参数都有自己的学习率，并且由于算法的特殊性，学习率单调递减。这导致了最大的问题：在某个时间点学习率太小以至于系统停止学习。
AdaDelta resolvesAdaGrad学习率单调递减的问题。在 AdaGrad 中，学习率的计算方法大致为 1 除以平方根之和。在每个阶段，您将另一个平方根添加到总和中，这会导致分母不断增加。在 AdaDelta 中，它不是对所有过去的平方根求和，而是使用滑动 window 允许总和减少。 RMSprop 与 AdaDelta
Adam或者adaptive momentum是一种类似于AdaDelta的算法。但是除了存储每个参数的学习率外，它还分别存储每个参数的动量变化。

一个few visualizations:

我会说 SGD、Momentum 和 Nesterov 不如最后三个。

Answer 2

already explains about the differences between some popular methods (i.e. optimizers), but I would try to elaborate on them some more.
(Note that our answers disagree about some points, especially regarding ADAGRAD.)

Classical Momentum (CM) vs Nesterov's Accelerated Gradient (NAG)

(Mostly based on section 2 in the paper On the importance of initialization and momentum in deep learning.)

Each step in both CM and NAG is actually composed of two sub-steps:

A momentum sub-step - This is simply a fraction (typically in the range [0.9,1)) of the last step.
A gradient dependent sub-step - This is like the usual step in SGD - it is the product of the learning rate and the vector opposite to the gradient, while the gradient is computed where this sub-step starts from.

CM takes the gradient sub-step first, while NAG takes the momentum sub-step first.

Here is a demonstration from an answer about intuition for CM and NAG:

So NAG seems to be better (at least in the image), but why?

The important thing to note is that it doesn't matter when the momentum sub-step comes - it would be the same either way. Therefore, we might as well behave is if the momentum sub-step has already been taken.

Thus, the question actually is: Assuming that the gradient sub-step is taken after the momentum sub-step, should we calculate the gradient sub-step as if it started in the position before or after taking the momentum sub-step?

"After it" seems like the right answer, as generally, the gradient at some point θ roughly points you in the direction from θ to a minimum (with the relatively right magnitude), while the gradient at some other point is less likely to point you in the direction from θ to a minimum (with the relatively right magnitude).

Here is a demonstration (from the gif below):

The minimum is where the star is, and the curves are contour lines. (For an explanation about contour lines and why they are perpendicular to the gradient, see videos 1 and 2 by the legendary 3Blue1Brown.)
The (long) purple arrow is the momentum sub-step.
The transparent red arrow is the gradient sub-step if it starts before the momentum sub-step.
The black arrow is the gradient sub-step if it starts after the momentum sub-step.
CM would end up in the target of the dark red arrow.
NAG would end up in the target of the black arrow.

Note that this argument for why NAG is better is independent of whether the algorithm is close to a minimum.
In general, both NAG and CM often have the problem of accumulating more momentum than is good for them, so whenever they should change direction, they have an embarrassing "response time". The advantage of NAG over CM that we explained doesn't prevent the problem, but only makes the "response time" of NAG less embarrassing (but embarrassing still).

This "response time" problem is beautifully demonstrated in the gif by Alec Radford (which appeared in ):

ADAGRAD

(Mostly based on section 2.2.2 in ADADELTA: An Adaptive Learning Rate Method (the original ADADELTA paper), as I find it much more accessible than Adaptive Subgradient Methods for Online Learning and Stochastic Optimization (the original ADAGRAD paper).)

In SGD, the step is given by - learning_rate * gradient, while learning_rate is a hyperparameter.
ADAGRAD also has a learning_rate hyperparameter, but the actual learning rate for each component of the gradient is calculated individually.
The i-th component of the t-th step is given by:

              learning_rate 
- --------------------------------------- * gradient_i_t
  norm((gradient_i_1, ..., gradient_i_t))

while:

gradient_i_k is the i-th component of the gradient in the k-th step
(gradient_i_1, ..., gradient_i_t) is a vector with t components. This isn't intuitive (at least to me) that constructing such a vector makes sense, but that's what the algorithm does (conceptually).
norm(vector) is the Eucldiean norm (aka l2 norm) of vector, which is our intuitive notion of length of vector.
Confusingly, in ADAGRAD (as well as in some other methods) the expression that is multiplied by gradient_i_t (in this case, learning_rate / norm(...)) is often called "the learning rate" (in fact, I called it "the actual learning rate" in the previous paragraph). I guess this is because in SGD the learning_rate hyperparameter and this expression are one and the same.
In a real implementation, some constant would be added to the denominator, to prevent a division by zero.

E.g. if:

The i-th component of the gradient in the first step is 1.15
The i-th component of the gradient in the second step is 1.35
The i-th component of the gradient in the third step is 0.9

Then the norm of (1.15, 1.35, 0.9) is the length of the yellow line, which is:
sqrt(1.15^2 + 1.35^2 + 0.9^2) = 1.989.
And so the i-th component of the third step is: - learning_rate / 1.989 * 0.9

Note two things about the i-th component of the step:

It is proportional to learning_rate.
In the calculations of it, the norm is increasing, and so the learning rate is decreasing.

This means that ADAGRAD is sensitive to the choice of the hyperparameter learning_rate.
In addition, it might be that after some time the steps become so small, that ADAGRAD virtually gets stuck.

ADADELTA and RMSProp

From the ADADELTA paper:

The idea presented in this paper was derived from ADAGRAD in order to improve upon the two main drawbacks of the method: 1) the continual decay of learning rates throughout training, and 2) the need for a manually selected global learning rate.

The paper then explains an improvement that is meant to tackle the first drawback:

Instead of accumulating the sum of squared gradients over all time, we restricted the window of past gradients that are accumulated to be some fixed size w [...]. This ensures that learning continues to make progress even after many iterations of updates have been done.
Since storing w previous squared gradients is inefficient, our methods implements this accumulation as an exponentially decaying average of the squared gradients.

By "exponentially decaying average of the squared gradients" the paper means that for each i we compute a weighted average of all of the squared i-th components of all of the gradients that were calculated.
The weight of each squared i-th component is bigger than the weight of the squared i-th component in the previous step.

This is an approximation of a window of size w because the weights in earlier steps are very small.

(When I think about an exponentially decaying average, I like to visualize a comet's trail, which becomes dimmer and dimmer as it gets further from the comet:

)

If you make only this change to ADAGRAD, then you will get RMSProp, which is a method that was proposed by Geoff Hinton in Lecture 6e of his Coursera Class.

So in RMSProp, the i-th component of the t-th step is given by:

                   learning_rate
- ------------------------------------------------ * gradient_i_t
  sqrt(exp_decay_avg_of_squared_grads_i + epsilon)

while:

epsilon is a hyperparameter that prevents a division by zero.
exp_decay_avg_of_squared_grads_i is an exponentially decaying average of the squared i-th components of all of the gradients calculated (including gradient_i_t).

But as aforementioned, ADADELTA also aims to get rid of the learning_rate hyperparameter, so there must be more stuff going on in it.

In ADADELTA, the i-th component of the t-th step is given by:

  sqrt(exp_decay_avg_of_squared_steps_i + epsilon)
- ------------------------------------------------ * gradient_i_t
  sqrt(exp_decay_avg_of_squared_grads_i + epsilon)

while exp_decay_avg_of_squared_steps_i is an exponentially decaying average of the squared i-th components of all of the steps calculated (until the t-1-th step).
sqrt(exp_decay_avg_of_squared_steps_i + epsilon) is somewhat similar to momentum, and according to the paper, it "acts as an acceleration term". (The paper also gives another reason for why it was added, but my answer is already too long, so if you are curious, check out section 3.2.)

Adam

(Mostly based on Adam: A Method for Stochastic Optimization, the original Adam paper.)

Adam is short for Adaptive Moment Estimation (see this answer for an explanation about the name).
The i-th component of the t-th step is given by:

                   learning_rate
- ------------------------------------------------ * exp_decay_avg_of_grads_i
  sqrt(exp_decay_avg_of_squared_grads_i) + epsilon

while:

exp_decay_avg_of_grads_i is an exponentially decaying average of the i-th components of all of the gradients calculated (including gradient_i_t).
Actually, both exp_decay_avg_of_grads_i and exp_decay_avg_of_squared_grads_i are also corrected to account for a bias toward 0 (for more about that, see section 3 in the paper, and also an answer in stats.stackexchange).

Note that Adam uses an exponentially decaying average of the i-th components of the gradients where most SGD methods use the i-th component of the current gradient. This causes Adam to behave like "a heavy ball with friction", as explained in the paper GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium.
See this answer for more about how Adam's momentum-like behavior is different from the usual momentum-like behavior.

Answer 3

让我们把它归结为几个简单的问题：

哪个优化器会给我最好的 result/accuracy？

没有灵丹妙药。您的任务的某些优化器可能比其他优化器工作得更好。没有办法事先告诉你，你必须尝试一些才能找到最好的。好消息是不同优化器的结果可能彼此接近。不过，您必须为您选择的任何单个优化器找到最佳超参数。

我现在应该使用哪个优化器？

也许，采用 AdamOptimizer 和运行它以获得 learning_rate 0.001 和 0.0001。如果您想要更好的结果，请尝试运行ning 其他学习率。或者尝试其他优化器并调整它们的超参数。

说来话长

选择优化器时需要考虑几个方面：

易于使用（即您可以多快找到适合您的参数）；
收敛速度（基本与 SGD 或更快）；
内存占用（通常在模型大小的 0 到 x2 之间）；
与培训过程其他部分的关系。

普通 SGD 是可以做到的最低限度：它只是将梯度乘以学习率并将结果添加到权重。 SGD 有许多美丽的品质：它只有 1 个超参数；它不需要任何额外的内存；它对训练的其他部分影响很小。它也有 2 个缺点：它可能对学习率的选择过于敏感，并且训练可能比其他方法花费更长的时间。

从普通 SGD 的这些缺点我们可以看出更复杂的更新规则（优化器）是为了什么：我们牺牲一部分内存来实现更快的训练，并可能简化超参数的选择。

内存开销 通常不重要，可以忽略。除非模型特别大，或者你在 GTX760 上训练，或者争夺 ImageNet 领先地位。更简单的方法，如动量或 Nesterov 加速梯度需要 1.0 或更小的模型大小（模型超参数的大小）。二阶方法（Adam，可能需要两倍的内存和计算量。

收敛速度-wise 几乎任何东西都比 SGD 好，其他任何东西都很难比较。一个注意事项可能是 AdamOptimizer 擅长几乎立即开始训练，无需热身。

我认为易于使用是选择优化器的最重要因素。不同的优化器具有不同数量的超参数并且对它们具有不同的敏感性。我认为 Adam 是所有现成的中最简单的。您通常需要检查 0.001 和 0.0001 之间的 2-4 learning_rate 秒以确定模型是否很好地收敛。为了比较 SGD（和动量），我通常尝试 [0.1, 0.01, ... 10e-5]。 Adam 还有 2 个很少需要更改的超参数。

优化器与训练其他部分的关系。超参数调整通常涉及同时选择 {learning_rate, weight_decay, batch_size, droupout_rate}。所有这些都是相互关联的，每一个都可以看作是模型正则化的一种形式。例如，如果使用 weight_decay 或 L2-范数，则必须密切注意，并且可能选择 AdamWOptimizer 而不是 AdamOptimizer。

TensorFlow 中的梯度下降 vs Adagrad vs Momentum

Gradient Descent vs Adagrad vs Momentum in TensorFlow

deep-learning

tensorflow

Classical Momentum (CM) vs Nesterov's Accelerated Gradient (NAG)

ADAGRAD

ADADELTA and RMSProp

Adam

说来话长