如何在给定小批量的情况下计算每一层的最大梯度

How to calculate maximum gradient for each layer given a mini-batch

我尝试使用 MNIST 数据集实现一个完全连接的分类模型。部分代码如下:

n = 5
act_func = 'relu'

classifier = tf.keras.models.Sequential()
classifier.add(layers.Flatten(input_shape = (28, 28, 1)))
for i in range(n):
  classifier.add(layers.Dense(32, activation=act_func))
classifier.add(layers.Dense(10, activation='softmax'))
opt = tf.keras.optimizers.SGD(learning_rate=0.01)
classifier.compile(optimizer=opt,loss="categorical_crossentropy",metrics ="accuracy")

classifier.summary()

history = classifier.fit(x_train, y_train, batch_size=32, epochs=3, validation_data=(x_test,y_test))

有没有办法为给定的小批量打印每层的最大梯度?

您可以使用 tf.GradientTape:

从自定义训练循环开始
import tensorflow as tf
import tensorflow_datasets as tfds 

(ds_train, ds_test), ds_info = tfds.load(
    'mnist',
    split=['train', 'test'],
    shuffle_files=True,
    as_supervised=True,
    with_info=True,
)
n = 5
act_func = 'relu'

classifier = tf.keras.models.Sequential()
classifier.add(tf.keras.layers.Flatten(input_shape = (28, 28, 1)))
for i in range(n):
  classifier.add(tf.keras.layers.Dense(32, activation=act_func))
classifier.add(tf.keras.layers.Dense(10, activation='softmax'))
opt = tf.keras.optimizers.SGD(learning_rate=0.01)
loss = tf.keras.losses.CategoricalCrossentropy()

classifier.summary()

epochs = 1
for epoch in range(epochs):
    print("\nStart of epoch %d" % (epoch,))
    for step, (x_batch_train, y_batch_train) in enumerate(ds_train.take(50).batch(10)):
        x_batch_train = tf.cast(x_batch_train, dtype=tf.float32)
        y_batch_train = tf.keras.utils.to_categorical(y_batch_train, 10)

        with tf.GradientTape() as tape:
          logits = classifier(x_batch_train, training=True)
          loss_value = loss(y_batch_train, logits)

        grads = tape.gradient(loss_value, classifier.trainable_weights)
        opt.apply_gradients(zip(grads, classifier.trainable_weights)) 

        with tf.GradientTape(persistent=True) as tape:
          tape.watch(x_batch_train)
          x = classifier.layers[0](x_batch_train)
          outputs = []
          for layer in classifier.layers[1:]:
              x = layer(x)
              outputs.append(x)

        for idx, output in enumerate(outputs):
           grad = tf.math.abs(tape.gradient(output, x_batch_train))
           print('Max gradient for layer {} is {}'.format(idx + 1, tf.reduce_max(grad)))
        print('End of batch {}'.format(step + 1))
Model: "sequential_9"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 flatten_9 (Flatten)         (None, 784)               0         
                                                                 
 dense_54 (Dense)            (None, 32)                25120     
                                                                 
 dense_55 (Dense)            (None, 32)                1056      
                                                                 
 dense_56 (Dense)            (None, 32)                1056      
                                                                 
 dense_57 (Dense)            (None, 32)                1056      
                                                                 
 dense_58 (Dense)            (None, 32)                1056      
                                                                 
 dense_59 (Dense)            (None, 10)                330       
                                                                 
=================================================================
Total params: 29,674
Trainable params: 29,674
Non-trainable params: 0
_________________________________________________________________

Start of epoch 0
Max gradient for layer 1 is 0.7913536429405212
Max gradient for layer 2 is 0.8477020859718323
Max gradient for layer 3 is 0.7188305854797363
Max gradient for layer 4 is 0.5108454823493958
Max gradient for layer 5 is 0.3362882435321808
Max gradient for layer 6 is 1.9748875867975357e-09
End of batch 1
Max gradient for layer 1 is 0.7535678148269653
Max gradient for layer 2 is 0.6814548373222351
Max gradient for layer 3 is 0.5748667120933533
Max gradient for layer 4 is 0.5439972877502441
Max gradient for layer 5 is 0.27793681621551514
Max gradient for layer 6 is 1.9541412932255753e-09
End of batch 2
Max gradient for layer 1 is 0.8606255650520325
Max gradient for layer 2 is 0.8506941795349121
Max gradient for layer 3 is 0.8556670546531677
Max gradient for layer 4 is 0.43756356835365295
Max gradient for layer 5 is 0.2675274908542633
Max gradient for layer 6 is 3.7072431791074223e-09
End of batch 3
Max gradient for layer 1 is 0.7640039324760437
Max gradient for layer 2 is 0.6926062107086182
Max gradient for layer 3 is 0.6164448857307434
Max gradient for layer 4 is 0.43013691902160645
Max gradient for layer 5 is 0.32356566190719604
Max gradient for layer 6 is 3.2926392723453546e-09
End of batch 4
Max gradient for layer 1 is 0.7604862451553345
Max gradient for layer 2 is 0.6908300518989563
Max gradient for layer 3 is 0.6122230887413025
Max gradient for layer 4 is 0.39982378482818604
Max gradient for layer 5 is 0.3172021210193634
Max gradient for layer 6 is 2.3238742041797877e-09
End of batch 5

您可以编写一个自定义循环来代替调用 compilefit

optimizer=tf.keras.optimizers.Adam(0.001)
loss=tf.keras.losses.SparseCategoricalCrossentropy()

# Iterate over the batches of a dataset.
for inputs, targets in zip(x_train, y_train):
    # Open a GradientTape.
    with tf.GradientTape() as tape:
        # Forward pass.
        predictions = model(inputs)
        # Compute the loss value for this batch.
        loss_value = loss(targets, predictions)
    
    # Get gradients of loss wrt the weights.
    gradients = tape.gradient(loss_value, model.trainable_weights)
    grads_and_vars = zip(gradients, model.trainable_weights)
    # Update the weights of the model.
    optimizer.apply_gradients(grads_and_vars)

然后您可以遍历图层并检查

for layer in range(0, 4): # for 4 layers
    print('max gradient of layer={}, kernel={}, bias={}'.format(
        layer, gradients[layer].numpy().max(), gradients[layer*2+1].numpy().max()))