随机梯度下降的批量大小是训练数据的长度而不是 1？

Question

我正在尝试绘制使用批量梯度下降、随机梯度下降和小批量随机梯度下降时的不同学习结果。

无论我在哪里看，我都读到 batch_size=1 与普通 SGD 相同，batch_size=len(train_data) 与具有相同批量梯度下降。

我知道随机梯度下降是指每次更新只使用一个数据样本，批量梯度下降使用整个训练数据集来计算 objective 函数/更新的梯度。

然而，在使用 keras 实现 batch_size 时，情况似乎恰恰相反。以我的代码为例，我将 batch_size 设置为等于 training_data

的长度

input_size = len(train_dataset.keys())
output_size = 10
hidden_layer_size = 250
n_epochs = 250

weights_initializer = keras.initializers.GlorotUniform()

#A function that trains and validates the model and returns the MSE
def train_val_model(run_dir, hparams):
    model = keras.models.Sequential([
            #Layer to be used as an entry point into a Network
            keras.layers.InputLayer(input_shape=[len(train_dataset.keys())]),
            #Dense layer 1
            keras.layers.Dense(hidden_layer_size, activation='relu', 
                               kernel_initializer = weights_initializer,
                               name='Layer_1'),
            #Dense layer 2
            keras.layers.Dense(hidden_layer_size, activation='relu', 
                               kernel_initializer = weights_initializer,
                               name='Layer_2'),
            #activation function is linear since we are doing regression
            keras.layers.Dense(output_size, activation='linear', name='Output_layer')
                                ])
    
    #Use the stochastic gradient descent optimizer but change batch_size to get BSG, SGD or MiniSGD
    optimizer = tf.keras.optimizers.SGD(learning_rate=0.001, momentum=0.0,
                                        nesterov=False)
    
    #Compiling the model
    model.compile(optimizer=optimizer, 
                  loss='mean_squared_error', #Computes the mean of squares of errors between labels and predictions
                  metrics=['mean_squared_error']) #Computes the mean squared error between y_true and y_pred
    
    # initialize TimeStopping callback 
    time_stopping_callback = tfa.callbacks.TimeStopping(seconds=5*60, verbose=1)
    
    #Training the network
    history = model.fit(normed_train_data, train_labels, 
         epochs=n_epochs,
         batch_size=hparams['batch_size'], 
         verbose=1,
         #validation_split=0.2,
         callbacks=[tf.keras.callbacks.TensorBoard(run_dir + "/Keras"), time_stopping_callback])
    
    return history

train_val_model("logs/sample", {'batch_size': len(normed_train_data)})

当运行时，输出似乎显示每个时期的单个更新，即 SGD :

从每个纪元的下方可以看出，它表示 1/1，我认为这意味着单次更新迭代。另一方面，如果我设置 batch_size=1 我得到 90000/90000，这是我整个数据集的大小（训练时间明智这也有意义）。

所以，我的问题是，batch_size=1 实际上是批量梯度下降而不是随机梯度下降，batch_size=len(train_data) 实际上是随机梯度下降而不是随机梯度下降批量梯度下降？

Answer 1

batch_size是每次更新的大小。

这里，batch_size=1表示每次更新的大小为1个样本。根据您的定义，这将是 SGD。

如果你有 batch_size=len(train_data)，这意味着每次更新你的权重都需要从你的整个数据集中得到梯度。这实际上只是很好的旧梯度下降。

批量梯度下降在中间某处，其中 batch_size 不是 1 并且批量大小不是您的整个训练数据集。以 32 为例。批量梯度下降会每 32 个示例更新一次权重，因此它仅用 1 个示例（异常值可能会产生很大影响）就可以消除 SGD 的鲁棒性，并且具有 SGD 优于常规梯度的优势 descne.t

Answer 2

实际上有三 (3) 个案例：

batch_size = 1 表示确实是随机梯度下降（SGD）
一个batch_size等于整个训练数据就是（batch）梯度下降（GD）
中间情况（在实践中实际使用）通常被称为mini-batch梯度下降

有关更多详细信息和参考，请参阅 A Gentle Introduction to Mini-Batch Gradient Descent and How to Configure Batch Size。事实上，在实践中，当我们说“SGD”时，我们通常指的是“小批量 SGD”。

这些定义实际上完全符合您的实验报告：

With batch_size=len(train_data) (GD case)，每个时期确实只需要 one 更新（因为只有一批），因此1/1 Keras 输出中的指示。
相比之下，对于 batch_size = 1（SGD 情况），您希望更新与训练数据中的样本一样多（因为这是现在的批次数），即 90000，因此 Keras 输出中的 90000/90000 指示。

即每个纪元的更新次数（Keras 指示）等于使用的批次数（并且 not 等于批次大小）。

随机梯度下降的批量大小是训练数据的长度而不是 1？

Batch size for Stochastic gradient descent is length of training data and not 1?

python

machine-learning

neural-network

gradient-descent

keras