梯度下降与随机梯度下降算法

Question

我尝试在 MNIST 手写数字数据集（包括 60K 个训练样本）上训练前馈神经网络。

我每次迭代所有训练样本，在每个时期对每个这样的样本执行反向传播。运行时间当然太长了。

我运行的算法是否命名为梯度下降？

我读到对于大型数据集，使用 随机梯度下降 可以显着提高运行时间。

我应该怎么做才能使用随机梯度下降？我是否应该仅运行domly 选择训练样本，对每个运行domly 挑选的样本执行 反向传播 ，而不是我当前使用的 epoch？

Answer 1

您描述的新场景（对每个随机选取的样本执行反向传播），是随机梯度下降的一种常见"flavor"，如下所述：https://www.quora.com/Whats-the-difference-between-gradient-descent-and-stochastic-gradient-descent

根据本文档，最常见的 3 种口味是（您的口味是 C）：

A)

randomly shuffle samples in the training set
for one or more epochs, or until approx. cost minimum is reached:
    for training sample i:
        compute gradients and perform weight updates

B)

for one or more epochs, or until approx. cost minimum is reached:
    randomly shuffle samples in the training set
    for training sample i:
        compute gradients and perform weight updates

C)

for iterations t, or until approx. cost minimum is reached:
    draw random sample from the training set
    compute gradients and perform weight updates

Answer 2

我会试着给你一些关于这个问题的直觉...

最初，更新是在您（正确地）称为 （批量）梯度下降 的内容中进行的。这确保了权重的每次更新都是在 "right" 方向上完成的（图 1）：最小化成本函数的方向。

随着数据集大小的增长，以及每一步的计算越来越复杂，随机梯度下降在这些情况下成为首选。在这里，权重的更新是在处理每个样本时完成的，因此，后续计算已经使用 "improved" 权重。尽管如此，正是这个原因导致它在最小化误差函数时出现了一些误导（图2）。

因此，在许多情况下，最好使用 小批量梯度下降，结合两全其美：每次更新权重都使用小批量完成的数据。这样，与随机更新相比，更新的方向有所调整，但比（原始）Gradient Descent.[=15= 的情况更规律地更新]

[更新] 根据要求，我在下面给出 batch 梯度下降二进制 classification 的伪代码：

error = 0

for sample in data:
    prediction = neural_network.predict(sample)
    sample_error = evaluate_error(prediction, sample["label"]) # may be as simple as 
                                                # module(prediction - sample["label"])
    error += sample_error

neural_network.backpropagate_and_update(error)

（在多class标注的情况下，error表示每个标签的误差数组。）

对于给定的迭代次数，或者当错误高于阈值时，此代码是运行。对于随机梯度下降，对 neural_network.backpropagate_and_update() 的调用在 for 循环内调用，样本误差作为参数。

梯度下降与随机梯度下降算法

Gradient Descent vs Stochastic Gradient Descent algorithms

machine-learning

computer-vision

neural-network

gradient-descent