计算两个模型的梯度
Compute gradients across two models
假设我们正在构建一个识别猫和狗图片的基本 CNN(二元分类器)。
这种CNN的例子如下:
model = Sequential([
Conv2D(32, (3,3), input_shape=...),
Activation('relu'),
MaxPooling2D(pool_size=(2,2),
Conv2D(32, (3,3), input_shape=...),
Activation('relu'),
MaxPooling2D(pool_size=(2,2)
Conv2D(64, (3,3), input_shape=...),
Activation('relu'),
MaxPooling2D(pool_size=(2,2),
Flatten(),
Dense(64),
Activation('relu'),
Dropout(0.5),
Dense(1),
Activation('sigmoid')
])
我们还假设我们希望将模型分成两部分或两个模型,称为 model_0
和 model_1
。
model_0
将处理输入,model_1
将获取 model_0
输出并将其作为输入。
比如之前的模型会变成:
model_0 = Sequential([
Conv2D(32, (3,3), input_shape=...),
Activation('relu'),
MaxPooling2D(pool_size=(2,2),
Conv2D(32, (3,3), input_shape=...),
Activation('relu'),
MaxPooling2D(pool_size=(2,2)
Conv2D(64, (3,3), input_shape=...),
Activation('relu'),
MaxPooling2D(pool_size=(2,2)
])
model_1 = Sequential([
Flatten(),
Dense(64),
Activation('relu'),
Dropout(0.5),
Dense(1),
Activation('sigmoid')
])
如何像训练一个模型一样训练这两个模型?我曾尝试手动设置渐变,但我不明白如何将渐变从 model_1
传递到 model_0
:
for epoch in range(epochs):
for step, (x_batch, y_batch) in enumerate(train_generator):
# model 0
with tf.GradientTape() as tape_0:
y_pred_0 = model_0(x_batch, training=True)
# model 1
with tf.GradientTape() as tape_1:
y_pred_1 = model_1(y_pred_0, training=True)
loss_value = loss_fn(y_batch_tensor, y_pred_1)
grads_1 = tape_1.gradient(y_pred_1, model_1.trainable_weights)
grads_0 = tape_0.gradient(y_pred_0, model_0.trainable_weights)
optimizer.apply_gradients(zip(grads_1, model_1.trainable_weights))
optimizer.apply_gradients(zip(grads_0, model_0.trainable_weights))
这个方法当然不行,我基本上就是分别训练两个模型然后绑定起来,这不是我想要达到的。
这是一个 Google Colab notebook,用于此问题的更简单版本,仅使用两个完全连接的层和两个激活函数:https://colab.research.google.com/drive/14Px1rJtiupnB6NwtvbgeVYw56N1xM6JU#scrollTo=PeqtJJWS3wyG
请注意,我知道 Sequential([model_0, model_1])
,但这不是我想要实现的。我想手动进行反向传播步骤。
另外,我想继续使用两盘独立的磁带。这里的技巧是使用 grads_1
来计算 grads_0
.
有什么线索吗?
我认为 model_final = Sequential([model_0,model_1])
会成功
tf.GradientTape()
可以带一个参数 persistent
来控制持久渐变带是否默认为 created.False ,这意味着最多可以对 gradient() 方法进行一次调用在这个对象上。
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Activation, Dense, Layer, Flatten
import numpy as np
tf.random.set_seed(0)
# 5 batches, 2x2 images, 1 channel
x = tf.random.uniform((5, 2, 2, 1))
layer_0 = Sequential([Dense(2), Activation("relu")])
layer_1 = Sequential([Dense(2), Activation("relu")])
layer_2 = Sequential([Flatten(), Dense(1), Activation("sigmoid")])
loss_fn = tf.keras.losses.BinaryCrossentropy(from_logits=False)
optimizer = tf.keras.optimizers.SGD()
y = np.asarray([0, 0, 0, 0, 0]).astype('float32').reshape((-1, 1))
# print("x:", x)
print("x.shape:", x.shape)
with tf.GradientTape() as tape_0,tf.GradientTape(persistent=True) as tape_1,tf.GradientTape(persistent=True) as tape_2 :
out_layer_0 = layer_0(x, training=True)
out_layer_1 = layer_1(out_layer_0, training=True)
out_layer_2 = layer_2(out_layer_1, training=True)
loss = loss_fn(y, out_layer_2)
grads_0 = tape_0.gradient(loss, layer_0.trainable_weights)
grads_1 = tape_1.gradient(loss, layer_1.trainable_weights)
grads_2 = tape_2.gradient(loss, layer_2.trainable_weights)
你能看看这是否满足你的需求吗?
在寻求帮助并更好地理解 自动微分 (或 autodiff)的动态之后,我设法开始工作,我想要实现的简单示例。尽管这种方法没有完全解决问题,但它使我们在理解如何解决手头的问题方面向前迈进了一步。
参考模型
我已将模型简化为更小的模型:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Activation, Dense, Layer, Flatten, Conv2D
import numpy as np
tf.random.set_seed(0)
# 3 batches, 10x10 images, 1 channel
x = tf.random.uniform((3, 10, 10, 1))
y = tf.cast(tf.random.uniform((3, 1)) > 0.5, tf.float32)
layer_0 = Sequential([Conv2D(filters=6, kernel_size=2, activation="relu")])
layer_1 = Sequential([Conv2D(filters=6, kernel_size=2, activation="relu")])
layer_2 = Sequential([Flatten(), Dense(1), Activation("sigmoid")])
loss_fn = tf.keras.losses.MeanSquaredError()
我们将其分为三个部分,layer_0, layer_1, layer_2
。香草方法只是将所有东西放在一起并一个一个地(或一步)计算梯度:
with tf.GradientTape(persistent=True) as tape:
out_layer_0 = layer_0(x)
out_layer_1 = layer_1(out_layer_0)
out_layer_2 = layer_2(out_layer_1)
loss = loss_fn(y, out_layer_2)
只需调用 tape.gradient
:
即可计算出不同的梯度
ref_conv_dLoss_dWeights2 = tape.gradient(loss, layer_2.trainable_weights)
ref_conv_dLoss_dWeights1 = tape.gradient(loss, layer_1.trainable_weights)
ref_conv_dLoss_dWeights0 = tape.gradient(loss, layer_0.trainable_weights)
ref_conv_dLoss_dY = tape.gradient(loss, out_layer_2)
ref_conv_dLoss_dOut1 = tape.gradient(loss, out_layer_1)
ref_conv_dOut2_dOut1 = tape.gradient(out_layer_2, out_layer_1)
ref_conv_dLoss_dOut0 = tape.gradient(loss, out_layer_0)
ref_conv_dOut1_dOut0 = tape.gradient(out_layer_1, out_layer_0)
ref_conv_dOut0_dWeights0 = tape.gradient(out_layer_0, layer_0.trainable_weights)
ref_conv_dOut1_dWeights1 = tape.gradient(out_layer_1, layer_1.trainable_weights)
ref_conv_dOut2_dWeights2 = tape.gradient(out_layer_2, layer_2.trainable_weights)
稍后我们将使用这些值来比较我们方法的正确性。
使用手动 autodiff 拆分模型
对于分裂,我们的意思是每个layer_x
需要有自己的GradientTape
,负责生成自己的梯度:
with tf.GradientTape(persistent=True) as tape_0:
out_layer_0 = model.layers[0](x)
with tf.GradientTape(persistent=True) as tape_1:
tape_1.watch(out_layer_0)
out_layer_1 = model.layers[1](out_layer_0)
with tf.GradientTape(persistent=True) as tape_2:
tape_2.watch(out_layer_1)
out_flatten = model.layers[2](out_layer_1)
out_layer_2 = model.layers[3](out_flatten)
loss = loss_fn(y, out_layer_2)
现在,对每一步简单地使用 tape_n.gradient
是行不通的。我们基本上丢失了很多之后无法恢复的信息。
相反,我们必须使用 tape.jacobian
and tape.batch_jacobian
,除了 ,因为我们只有一个值作为来源。
dOut0_dWeights0 = tape_0.jacobian(out_layer_0, model.layers[0].trainable_weights)
dOut1_dOut0 = tape_1.batch_jacobian(out_layer_1, out_layer_0)
dOut1_dWeights1 = tape_1.jacobian(out_layer_1, model.layers[1].trainable_weights)
dOut2_dOut1 = tape_2.batch_jacobian(out_layer_2, out_layer_1)
dOut2_dWeights2 = tape_2.jacobian(out_layer_2, model.layers[3].trainable_weights)
dLoss_dOut2 = tape_2.gradient(loss, out_layer_2) # or dL/dY
我们将使用几个实用函数将结果调整为我们想要的结果:
def add_missing_axes(source_tensor, target_tensor):
len_missing_axes = len(target_tensor.shape) - len(source_tensor.shape)
# note: the number of tf.newaxis is determined by the number of axis missing to reach
# the same dimension of the target tensor
assert len_missing_axes >= 0
# convenience renaming
source_tensor_extended = source_tensor
# add every missing axis
for _ in range(len_missing_axes):
source_tensor_extended = source_tensor_extended[..., tf.newaxis]
return source_tensor_extended
def upstream_gradient_loss_weights(dOutUpstream_dWeightsLocal, dLoss_dOutUpstream):
dLoss_dOutUpstream_extended = add_missing_axes(dLoss_dOutUpstream, dOutUpstream_dWeightsLocal)
# reduce over the first axes
len_reduce = range(len(dLoss_dOutUpstream.shape))
return tf.reduce_sum(dOutUpstream_dWeightsLocal * dLoss_dOutUpstream_extended, axis=len_reduce)
def upstream_gradient_loss_out(dOutUpstream_dOutLocal, dLoss_dOutUpstream):
dLoss_dOutUpstream_extended = add_missing_axes(dLoss_dOutUpstream, dOutUpstream_dOutLocal)
len_reduce = range(len(dLoss_dOutUpstream.shape))[1:]
return tf.reduce_sum(dOutUpstream_dOutLocal * dLoss_dOutUpstream_extended, axis=len_reduce)
最后,我们可以应用链式法则:
dOut2_dOut1 = tape_2.batch_jacobian(out_layer_2, out_layer_1)
dOut2_dWeights2 = tape_2.jacobian(out_layer_2, model.layers[3].trainable_weights)
dLoss_dOut2 = tape_2.gradient(loss, out_layer_2) # or dL/dY
dLoss_dWeights2 = upstream_gradient_loss_weights(dOut2_dWeights2[0], dLoss_dOut2)
dLoss_dBias2 = upstream_gradient_loss_weights(dOut2_dWeights2[1], dLoss_dOut2)
dLoss_dOut1 = upstream_gradient_loss_out(dOut2_dOut1, dLoss_dOut2)
dLoss_dWeights1 = upstream_gradient_loss_weights(dOut1_dWeights1[0], dLoss_dOut1)
dLoss_dBias1 = upstream_gradient_loss_weights(dOut1_dWeights1[1], dLoss_dOut1)
dLoss_dOut0 = upstream_gradient_loss_out(dOut1_dOut0, dLoss_dOut1)
dLoss_dWeights0 = upstream_gradient_loss_weights(dOut0_dWeights0[0], dLoss_dOut0)
dLoss_dBias0 = upstream_gradient_loss_weights(dOut0_dWeights0[1], dLoss_dOut0)
print("dLoss_dWeights2 valid:", tf.experimental.numpy.allclose(ref_conv_dLoss_dWeights2[0], dLoss_dWeights2).numpy())
print("dLoss_dBias2 valid:", tf.experimental.numpy.allclose(ref_conv_dLoss_dWeights2[1], dLoss_dBias2).numpy())
print("dLoss_dWeights1 valid:", tf.experimental.numpy.allclose(ref_conv_dLoss_dWeights1[0], dLoss_dWeights1).numpy())
print("dLoss_dBias1 valid:", tf.experimental.numpy.allclose(ref_conv_dLoss_dWeights1[1], dLoss_dBias1).numpy())
print("dLoss_dWeights0 valid:", tf.experimental.numpy.allclose(ref_conv_dLoss_dWeights0[0], dLoss_dWeights0).numpy())
print("dLoss_dBias0 valid:", tf.experimental.numpy.allclose(ref_conv_dLoss_dWeights0[1], dLoss_dBias0).numpy())
输出将是:
dLoss_dWeights2 valid: True
dLoss_dBias2 valid: True
dLoss_dWeights1 valid: True
dLoss_dBias1 valid: True
dLoss_dWeights0 valid: True
dLoss_dBias0 valid: True
因为所有值都彼此接近。请注意,使用 Jacobian 矩阵的方法,我们将有一定程度的 error/approximation,大约 10^-7
,但我认为这已经足够了。
陷阱
对于极端或玩具模型,这是完美的并且效果很好。但是,在真实场景中,您将拥有具有大量尺寸的大图像。这在处理 Jacobian 行列式矩阵时并不理想,它可以很快达到很高的维度。但这本身就是一个问题。
您可以在以下资源中阅读有关该主题的更多信息:
假设我们正在构建一个识别猫和狗图片的基本 CNN(二元分类器)。
这种CNN的例子如下:
model = Sequential([
Conv2D(32, (3,3), input_shape=...),
Activation('relu'),
MaxPooling2D(pool_size=(2,2),
Conv2D(32, (3,3), input_shape=...),
Activation('relu'),
MaxPooling2D(pool_size=(2,2)
Conv2D(64, (3,3), input_shape=...),
Activation('relu'),
MaxPooling2D(pool_size=(2,2),
Flatten(),
Dense(64),
Activation('relu'),
Dropout(0.5),
Dense(1),
Activation('sigmoid')
])
我们还假设我们希望将模型分成两部分或两个模型,称为 model_0
和 model_1
。
model_0
将处理输入,model_1
将获取 model_0
输出并将其作为输入。
比如之前的模型会变成:
model_0 = Sequential([
Conv2D(32, (3,3), input_shape=...),
Activation('relu'),
MaxPooling2D(pool_size=(2,2),
Conv2D(32, (3,3), input_shape=...),
Activation('relu'),
MaxPooling2D(pool_size=(2,2)
Conv2D(64, (3,3), input_shape=...),
Activation('relu'),
MaxPooling2D(pool_size=(2,2)
])
model_1 = Sequential([
Flatten(),
Dense(64),
Activation('relu'),
Dropout(0.5),
Dense(1),
Activation('sigmoid')
])
如何像训练一个模型一样训练这两个模型?我曾尝试手动设置渐变,但我不明白如何将渐变从 model_1
传递到 model_0
:
for epoch in range(epochs):
for step, (x_batch, y_batch) in enumerate(train_generator):
# model 0
with tf.GradientTape() as tape_0:
y_pred_0 = model_0(x_batch, training=True)
# model 1
with tf.GradientTape() as tape_1:
y_pred_1 = model_1(y_pred_0, training=True)
loss_value = loss_fn(y_batch_tensor, y_pred_1)
grads_1 = tape_1.gradient(y_pred_1, model_1.trainable_weights)
grads_0 = tape_0.gradient(y_pred_0, model_0.trainable_weights)
optimizer.apply_gradients(zip(grads_1, model_1.trainable_weights))
optimizer.apply_gradients(zip(grads_0, model_0.trainable_weights))
这个方法当然不行,我基本上就是分别训练两个模型然后绑定起来,这不是我想要达到的。
这是一个 Google Colab notebook,用于此问题的更简单版本,仅使用两个完全连接的层和两个激活函数:https://colab.research.google.com/drive/14Px1rJtiupnB6NwtvbgeVYw56N1xM6JU#scrollTo=PeqtJJWS3wyG
请注意,我知道 Sequential([model_0, model_1])
,但这不是我想要实现的。我想手动进行反向传播步骤。
另外,我想继续使用两盘独立的磁带。这里的技巧是使用 grads_1
来计算 grads_0
.
有什么线索吗?
我认为 model_final = Sequential([model_0,model_1])
会成功
tf.GradientTape()
可以带一个参数 persistent
来控制持久渐变带是否默认为 created.False ,这意味着最多可以对 gradient() 方法进行一次调用在这个对象上。
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Activation, Dense, Layer, Flatten
import numpy as np
tf.random.set_seed(0)
# 5 batches, 2x2 images, 1 channel
x = tf.random.uniform((5, 2, 2, 1))
layer_0 = Sequential([Dense(2), Activation("relu")])
layer_1 = Sequential([Dense(2), Activation("relu")])
layer_2 = Sequential([Flatten(), Dense(1), Activation("sigmoid")])
loss_fn = tf.keras.losses.BinaryCrossentropy(from_logits=False)
optimizer = tf.keras.optimizers.SGD()
y = np.asarray([0, 0, 0, 0, 0]).astype('float32').reshape((-1, 1))
# print("x:", x)
print("x.shape:", x.shape)
with tf.GradientTape() as tape_0,tf.GradientTape(persistent=True) as tape_1,tf.GradientTape(persistent=True) as tape_2 :
out_layer_0 = layer_0(x, training=True)
out_layer_1 = layer_1(out_layer_0, training=True)
out_layer_2 = layer_2(out_layer_1, training=True)
loss = loss_fn(y, out_layer_2)
grads_0 = tape_0.gradient(loss, layer_0.trainable_weights)
grads_1 = tape_1.gradient(loss, layer_1.trainable_weights)
grads_2 = tape_2.gradient(loss, layer_2.trainable_weights)
你能看看这是否满足你的需求吗?
在寻求帮助并更好地理解 自动微分 (或 autodiff)的动态之后,我设法开始工作,我想要实现的简单示例。尽管这种方法没有完全解决问题,但它使我们在理解如何解决手头的问题方面向前迈进了一步。
参考模型
我已将模型简化为更小的模型:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Activation, Dense, Layer, Flatten, Conv2D
import numpy as np
tf.random.set_seed(0)
# 3 batches, 10x10 images, 1 channel
x = tf.random.uniform((3, 10, 10, 1))
y = tf.cast(tf.random.uniform((3, 1)) > 0.5, tf.float32)
layer_0 = Sequential([Conv2D(filters=6, kernel_size=2, activation="relu")])
layer_1 = Sequential([Conv2D(filters=6, kernel_size=2, activation="relu")])
layer_2 = Sequential([Flatten(), Dense(1), Activation("sigmoid")])
loss_fn = tf.keras.losses.MeanSquaredError()
我们将其分为三个部分,layer_0, layer_1, layer_2
。香草方法只是将所有东西放在一起并一个一个地(或一步)计算梯度:
with tf.GradientTape(persistent=True) as tape:
out_layer_0 = layer_0(x)
out_layer_1 = layer_1(out_layer_0)
out_layer_2 = layer_2(out_layer_1)
loss = loss_fn(y, out_layer_2)
只需调用 tape.gradient
:
ref_conv_dLoss_dWeights2 = tape.gradient(loss, layer_2.trainable_weights)
ref_conv_dLoss_dWeights1 = tape.gradient(loss, layer_1.trainable_weights)
ref_conv_dLoss_dWeights0 = tape.gradient(loss, layer_0.trainable_weights)
ref_conv_dLoss_dY = tape.gradient(loss, out_layer_2)
ref_conv_dLoss_dOut1 = tape.gradient(loss, out_layer_1)
ref_conv_dOut2_dOut1 = tape.gradient(out_layer_2, out_layer_1)
ref_conv_dLoss_dOut0 = tape.gradient(loss, out_layer_0)
ref_conv_dOut1_dOut0 = tape.gradient(out_layer_1, out_layer_0)
ref_conv_dOut0_dWeights0 = tape.gradient(out_layer_0, layer_0.trainable_weights)
ref_conv_dOut1_dWeights1 = tape.gradient(out_layer_1, layer_1.trainable_weights)
ref_conv_dOut2_dWeights2 = tape.gradient(out_layer_2, layer_2.trainable_weights)
稍后我们将使用这些值来比较我们方法的正确性。
使用手动 autodiff 拆分模型
对于分裂,我们的意思是每个layer_x
需要有自己的GradientTape
,负责生成自己的梯度:
with tf.GradientTape(persistent=True) as tape_0:
out_layer_0 = model.layers[0](x)
with tf.GradientTape(persistent=True) as tape_1:
tape_1.watch(out_layer_0)
out_layer_1 = model.layers[1](out_layer_0)
with tf.GradientTape(persistent=True) as tape_2:
tape_2.watch(out_layer_1)
out_flatten = model.layers[2](out_layer_1)
out_layer_2 = model.layers[3](out_flatten)
loss = loss_fn(y, out_layer_2)
现在,对每一步简单地使用 tape_n.gradient
是行不通的。我们基本上丢失了很多之后无法恢复的信息。
相反,我们必须使用 tape.jacobian
and tape.batch_jacobian
,除了 ,因为我们只有一个值作为来源。
dOut0_dWeights0 = tape_0.jacobian(out_layer_0, model.layers[0].trainable_weights)
dOut1_dOut0 = tape_1.batch_jacobian(out_layer_1, out_layer_0)
dOut1_dWeights1 = tape_1.jacobian(out_layer_1, model.layers[1].trainable_weights)
dOut2_dOut1 = tape_2.batch_jacobian(out_layer_2, out_layer_1)
dOut2_dWeights2 = tape_2.jacobian(out_layer_2, model.layers[3].trainable_weights)
dLoss_dOut2 = tape_2.gradient(loss, out_layer_2) # or dL/dY
我们将使用几个实用函数将结果调整为我们想要的结果:
def add_missing_axes(source_tensor, target_tensor):
len_missing_axes = len(target_tensor.shape) - len(source_tensor.shape)
# note: the number of tf.newaxis is determined by the number of axis missing to reach
# the same dimension of the target tensor
assert len_missing_axes >= 0
# convenience renaming
source_tensor_extended = source_tensor
# add every missing axis
for _ in range(len_missing_axes):
source_tensor_extended = source_tensor_extended[..., tf.newaxis]
return source_tensor_extended
def upstream_gradient_loss_weights(dOutUpstream_dWeightsLocal, dLoss_dOutUpstream):
dLoss_dOutUpstream_extended = add_missing_axes(dLoss_dOutUpstream, dOutUpstream_dWeightsLocal)
# reduce over the first axes
len_reduce = range(len(dLoss_dOutUpstream.shape))
return tf.reduce_sum(dOutUpstream_dWeightsLocal * dLoss_dOutUpstream_extended, axis=len_reduce)
def upstream_gradient_loss_out(dOutUpstream_dOutLocal, dLoss_dOutUpstream):
dLoss_dOutUpstream_extended = add_missing_axes(dLoss_dOutUpstream, dOutUpstream_dOutLocal)
len_reduce = range(len(dLoss_dOutUpstream.shape))[1:]
return tf.reduce_sum(dOutUpstream_dOutLocal * dLoss_dOutUpstream_extended, axis=len_reduce)
最后,我们可以应用链式法则:
dOut2_dOut1 = tape_2.batch_jacobian(out_layer_2, out_layer_1)
dOut2_dWeights2 = tape_2.jacobian(out_layer_2, model.layers[3].trainable_weights)
dLoss_dOut2 = tape_2.gradient(loss, out_layer_2) # or dL/dY
dLoss_dWeights2 = upstream_gradient_loss_weights(dOut2_dWeights2[0], dLoss_dOut2)
dLoss_dBias2 = upstream_gradient_loss_weights(dOut2_dWeights2[1], dLoss_dOut2)
dLoss_dOut1 = upstream_gradient_loss_out(dOut2_dOut1, dLoss_dOut2)
dLoss_dWeights1 = upstream_gradient_loss_weights(dOut1_dWeights1[0], dLoss_dOut1)
dLoss_dBias1 = upstream_gradient_loss_weights(dOut1_dWeights1[1], dLoss_dOut1)
dLoss_dOut0 = upstream_gradient_loss_out(dOut1_dOut0, dLoss_dOut1)
dLoss_dWeights0 = upstream_gradient_loss_weights(dOut0_dWeights0[0], dLoss_dOut0)
dLoss_dBias0 = upstream_gradient_loss_weights(dOut0_dWeights0[1], dLoss_dOut0)
print("dLoss_dWeights2 valid:", tf.experimental.numpy.allclose(ref_conv_dLoss_dWeights2[0], dLoss_dWeights2).numpy())
print("dLoss_dBias2 valid:", tf.experimental.numpy.allclose(ref_conv_dLoss_dWeights2[1], dLoss_dBias2).numpy())
print("dLoss_dWeights1 valid:", tf.experimental.numpy.allclose(ref_conv_dLoss_dWeights1[0], dLoss_dWeights1).numpy())
print("dLoss_dBias1 valid:", tf.experimental.numpy.allclose(ref_conv_dLoss_dWeights1[1], dLoss_dBias1).numpy())
print("dLoss_dWeights0 valid:", tf.experimental.numpy.allclose(ref_conv_dLoss_dWeights0[0], dLoss_dWeights0).numpy())
print("dLoss_dBias0 valid:", tf.experimental.numpy.allclose(ref_conv_dLoss_dWeights0[1], dLoss_dBias0).numpy())
输出将是:
dLoss_dWeights2 valid: True
dLoss_dBias2 valid: True
dLoss_dWeights1 valid: True
dLoss_dBias1 valid: True
dLoss_dWeights0 valid: True
dLoss_dBias0 valid: True
因为所有值都彼此接近。请注意,使用 Jacobian 矩阵的方法,我们将有一定程度的 error/approximation,大约 10^-7
,但我认为这已经足够了。
陷阱
对于极端或玩具模型,这是完美的并且效果很好。但是,在真实场景中,您将拥有具有大量尺寸的大图像。这在处理 Jacobian 行列式矩阵时并不理想,它可以很快达到很高的维度。但这本身就是一个问题。
您可以在以下资源中阅读有关该主题的更多信息: