optimizer.apply_gradients 做梯度下降吗?
Does optimizer.apply_gradients do gradient descent?
我找到了以下代码:
# Iterate over the batches of the dataset.
for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):
# Open a GradientTape to record the operations run
# during the forward pass, which enables auto-differentiation.
with tf.GradientTape() as tape:
# Run the forward pass of the layer.
# The operations that the layer applies
# to its inputs are going to be recorded
# on the GradientTape.
logits = model(x_batch_train, training=True) # Logits for this minibatch
# Compute the loss value for this minibatch.
loss_value = loss_fn(y_batch_train, logits)
# Use the gradient tape to automatically retrieve
# the gradients of the trainable variables with respect to the loss.
grads = tape.gradient(loss_value, model.trainable_weights)
# Run one step of gradient descent by updating
# the value of the variables to minimize the loss.
optimizer.apply_gradients(zip(grads, model.trainable_weights))
最后一部分说
# Use the gradient tape to automatically retrieve
# the gradients of the trainable variables with respect to the loss.
grads = tape.gradient(loss_value, model.trainable_weights)
# Run one step of gradient descent by updating
# the value of the variables to minimize the loss.
optimizer.apply_gradients(zip(grads, model.trainable_weights))
但是我看了函数 apply_gradients 之后,我不确定这句话是不是
optimizer.apply_gradients(zip(grads, model.trainable_weights))
的“运行 通过更新一步梯度下降”是正确的。
因为它只更新梯度。而grads = tape.gradient(loss_value, model.trainable_weights)
只计算损失函数方面的推导。但是对于梯度下降,用梯度计算学习率并将其从损失函数的值中减去。但它似乎有效,因为损失在不断减少。所以我的问题是:apply_gradients 做的不仅仅是更新吗?
完整代码在这里:https://keras.io/guides/writing_a_training_loop_from_scratch/
.apply_gradients
使用梯度 对权重执行 更新。根据使用的优化器,它 可能 是梯度下降,即:
w_{t+1} := w_t - lr * g(w_t)
其中 g = grad(L)
请注意,不需要访问损失函数或其他任何东西,您只需要梯度(它是参数长度的向量)。
一般来说 .apply_gradients
可以做的不止于此,例如如果您要使用 Adam,它还会积累一些统计数据并使用它们来重新缩放梯度等。
我找到了以下代码:
# Iterate over the batches of the dataset.
for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):
# Open a GradientTape to record the operations run
# during the forward pass, which enables auto-differentiation.
with tf.GradientTape() as tape:
# Run the forward pass of the layer.
# The operations that the layer applies
# to its inputs are going to be recorded
# on the GradientTape.
logits = model(x_batch_train, training=True) # Logits for this minibatch
# Compute the loss value for this minibatch.
loss_value = loss_fn(y_batch_train, logits)
# Use the gradient tape to automatically retrieve
# the gradients of the trainable variables with respect to the loss.
grads = tape.gradient(loss_value, model.trainable_weights)
# Run one step of gradient descent by updating
# the value of the variables to minimize the loss.
optimizer.apply_gradients(zip(grads, model.trainable_weights))
最后一部分说
# Use the gradient tape to automatically retrieve
# the gradients of the trainable variables with respect to the loss.
grads = tape.gradient(loss_value, model.trainable_weights)
# Run one step of gradient descent by updating
# the value of the variables to minimize the loss.
optimizer.apply_gradients(zip(grads, model.trainable_weights))
但是我看了函数 apply_gradients 之后,我不确定这句话是不是
optimizer.apply_gradients(zip(grads, model.trainable_weights))
的“运行 通过更新一步梯度下降”是正确的。
因为它只更新梯度。而grads = tape.gradient(loss_value, model.trainable_weights)
只计算损失函数方面的推导。但是对于梯度下降,用梯度计算学习率并将其从损失函数的值中减去。但它似乎有效,因为损失在不断减少。所以我的问题是:apply_gradients 做的不仅仅是更新吗?
完整代码在这里:https://keras.io/guides/writing_a_training_loop_from_scratch/
.apply_gradients
使用梯度 对权重执行 更新。根据使用的优化器,它 可能 是梯度下降,即:
w_{t+1} := w_t - lr * g(w_t)
其中 g = grad(L)
请注意,不需要访问损失函数或其他任何东西,您只需要梯度(它是参数长度的向量)。
一般来说 .apply_gradients
可以做的不止于此,例如如果您要使用 Adam,它还会积累一些统计数据并使用它们来重新缩放梯度等。