optimizer.apply_gradients 做梯度下降吗？

Question

我找到了以下代码：

# Iterate over the batches of the dataset.
for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):

    # Open a GradientTape to record the operations run
    # during the forward pass, which enables auto-differentiation.
    with tf.GradientTape() as tape:

        # Run the forward pass of the layer.
        # The operations that the layer applies
        # to its inputs are going to be recorded
        # on the GradientTape.
        logits = model(x_batch_train, training=True)  # Logits for this minibatch

        # Compute the loss value for this minibatch.
        loss_value = loss_fn(y_batch_train, logits)

    # Use the gradient tape to automatically retrieve
    # the gradients of the trainable variables with respect to the loss.
    grads = tape.gradient(loss_value, model.trainable_weights)

    # Run one step of gradient descent by updating
    # the value of the variables to minimize the loss.
    optimizer.apply_gradients(zip(grads, model.trainable_weights))

最后一部分说

 # Use the gradient tape to automatically retrieve
# the gradients of the trainable variables with respect to the loss.
grads = tape.gradient(loss_value, model.trainable_weights)

# Run one step of gradient descent by updating
# the value of the variables to minimize the loss.
optimizer.apply_gradients(zip(grads, model.trainable_weights))

但是我看了函数 apply_gradients 之后，我不确定这句话是不是 optimizer.apply_gradients(zip(grads, model.trainable_weights)) 的“运行通过更新一步梯度下降”是正确的。因为它只更新梯度。而grads = tape.gradient(loss_value, model.trainable_weights)只计算损失函数方面的推导。但是对于梯度下降，用梯度计算学习率并将其从损失函数的值中减去。但它似乎有效，因为损失在不断减少。所以我的问题是：apply_gradients 做的不仅仅是更新吗？

完整代码在这里：https://keras.io/guides/writing_a_training_loop_from_scratch/

Answer 1

.apply_gradients 使用梯度对权重执行 更新。根据使用的优化器，它可能是梯度下降，即：

w_{t+1} := w_t - lr * g(w_t)

其中 g = grad(L)

请注意，不需要访问损失函数或其他任何东西，您只需要梯度（它是参数长度的向量）。

一般来说 .apply_gradients 可以做的不止于此，例如如果您要使用 Adam，它还会积累一些统计数据并使用它们来重新缩放梯度等。

optimizer.apply_gradients 做梯度下降吗？

Does optimizer.apply_gradients do gradient descent?

neural-network

deep-learning

tensorflow

tensorflow2.0