如何确保 TensorFlow 在从头编写的训练循环中保存整个数据集的平均损失

How to ensure that TensorFlow saves the average loss across the whole dataset in a training loop written from scratch

我正在使用 TensorFlow tutorial 中的代码训练我的条件 GAN 网络,该代码使用从头开始编写的训练循环

def fit(train_ds, epochs, test_ds):
  for epoch in range(epochs):
    start = time.time()

    display.clear_output(wait=True)

    for example_input, example_target in test_ds.take(1):
      generate_images(generator, example_input, example_target)
    print("Epoch: ", epoch)

    # Train
    for n, (input_image, target) in train_ds.enumerate():
      print('.', end='')
      if (n+1) % 100 == 0:
        print()
      train_step(input_image, target, epoch)
    print()

    # saving (checkpoint) the model every 20 epochs
    if (epoch + 1) % 20 == 0:
      checkpoint.save(file_prefix = checkpoint_prefix)

    print ('Time taken for epoch {} is {} sec\n'.format(epoch + 1,
                                                        time.time()-start))
  checkpoint.save(file_prefix = checkpoint_prefix)

训练步骤是这样定义的

@tf.function
def train_step(input_image, target, epoch):
  with tf.GradientTape() as gen_tape, tf.GradientTape() as disc_tape:
    gen_output = generator(input_image, training=True)

    disc_real_output = discriminator([input_image, target], training=True)
    disc_generated_output = discriminator([input_image, gen_output], training=True)

    gen_total_loss, gen_gan_loss, gen_l1_loss = generator_loss(disc_generated_output, gen_output, target)
    disc_loss = discriminator_loss(disc_real_output, disc_generated_output)

  generator_gradients = gen_tape.gradient(gen_total_loss,
                                          generator.trainable_variables)
  discriminator_gradients = disc_tape.gradient(disc_loss,
                                               discriminator.trainable_variables)

  generator_optimizer.apply_gradients(zip(generator_gradients,
                                          generator.trainable_variables))
  discriminator_optimizer.apply_gradients(zip(discriminator_gradients,
                                              discriminator.trainable_variables))

  with summary_writer.as_default():
    tf.summary.scalar('gen_total_loss', gen_total_loss, step=epoch)
    tf.summary.scalar('gen_gan_loss', gen_gan_loss, step=epoch)
    tf.summary.scalar('gen_l1_loss', gen_l1_loss, step=epoch)
    tf.summary.scalar('disc_loss', disc_loss, step=epoch)

现在我的问题是针对摘要编写者的,它是仅保存批次的损失还是整个数据集的平均值,如果是针对批次损失是它节省了,如果批次大小不同,我如何获得整个数据集的平均值? 我认为这是平均值,因为我从 tensorflow 教程中获得了代码,所以我相信它,但当我考虑它时,我不确定情况是否如此。

如果你想让 Tensorboard 只得到每个 epoch 的损失,你需要在每个 epoch 结束时将值保存到 Tensorboard,而不是在每个批次。

首先,为每个时期的平均值创建变量:

for epoch in range(epochs):
    mean_epoch_loss = tf.metrics.Mean()
    # etc...

并且在train_step中,用相应的损失更新这个值:

@tf.function
def train_step(input_image, target, epoch):
    # etc...
    mean_epoch_loss.update_state(epoch_loss)

在每个 epoch 结束时,将这个值保存到 Tensorboard 中:

for epoch in range(epochs):
    # etc...
    with summary_writer.as_default():
        tf.summary.scalar('mean_epoch_loss', mean_epoch_loss, step=epoch)