我的test loss上百万正常吗

Is it normal for my test loss to reach millions

我正在通过多次迭代(训练、保存和再次训练)训练模型,在第二次迭代中我的 val_loss 由于某种原因达到了数百万。我导入模型的方式有问题吗?

这就是我在第一次 运行

之后保存初始模型的方式
model.save('/content/drive/My Drive/Colab Notebooks/path/to/save/locaiton',save_format='tf')

这就是我导入和覆盖它的方式

def retrainmodel(model_path,tr_path,v_path):
  image_size = 224
  BATCH_SIZE_TRAINING = 10
  BATCH_SIZE_VALIDATION = 10
  BATCH_SIZE_TESTING = 1
  EARLY_STOP_PATIENCE = 6
  STEPS_PER_EPOCH_TRAINING = 10
  STEPS_PER_EPOCH_VALIDATION = 10
  NUM_EPOCHS = 20 

  model = tf.keras.models.load_model(model_path)

  data_generator = ImageDataGenerator(preprocessing_function=preprocess_input)


  train_generator = data_generator.flow_from_directory(tr_path,
        target_size=(image_size, image_size),
        batch_size=BATCH_SIZE_TRAINING,
        class_mode='categorical')
  
  validation_generator = data_generator.flow_from_directory(v_path,
        target_size=(image_size, image_size),
        batch_size=BATCH_SIZE_VALIDATION,
        class_mode='categorical') 
  
  cb_early_stopper = EarlyStopping(monitor = 'val_loss', patience = EARLY_STOP_PATIENCE)
  cb_checkpointer = ModelCheckpoint(filepath = 'path/to/checkpoint/folder', monitor = 'val_loss', save_best_only = True, mode = 'auto')

  fit_history = model.fit(
        train_generator,
        steps_per_epoch=STEPS_PER_EPOCH_TRAINING,
        epochs = NUM_EPOCHS,
        validation_data=validation_generator,
        validation_steps=STEPS_PER_EPOCH_VALIDATION,
        callbacks=[cb_checkpointer, cb_early_stopper]
  )

  model.save('/content/drive/My Drive/Colab Notebooks/path/to/save/locaiton',save_format='tf')
this is my output after passing my directories onto this function

Found 1421 images belonging to 5 classes.
Found 305 images belonging to 5 classes.
Epoch 1/20
10/10 [==============================] - 233s 23s/step - loss: 2.3330 - acc: 0.7200 - val_loss: 4.6237 - val_acc: 0.4400
Epoch 2/20
10/10 [==============================] - 171s 17s/step - loss: 2.7988 - acc: 0.5900 - val_loss: 56996.6289 - val_acc: 0.6800
Epoch 3/20
10/10 [==============================] - 159s 16s/step - loss: 1.2776 - acc: 0.6800 - val_loss: 8396707.0000 - val_acc: 0.6500
Epoch 4/20
10/10 [==============================] - 144s 14s/step - loss: 1.4562 - acc: 0.6600 - val_loss: 2099639.7500 - val_acc: 0.7200
Epoch 5/20
10/10 [==============================] - 126s 13s/step - loss: 1.0970 - acc: 0.7033 - val_loss: 50811.5781 - val_acc: 0.7300
Epoch 6/20
10/10 [==============================] - 127s 13s/step - loss: 0.7326 - acc: 0.8000 - val_loss: 84781.5703 - val_acc: 0.7000
Epoch 7/20
10/10 [==============================] - 110s 11s/step - loss: 1.2356 - acc: 0.7100 - val_loss: 1000.2982 - val_acc: 0.7300

这是我的优化器:

sgd = optimizers.SGD(lr = 0.01, decay = 1e-6, momentum = 0.9, nesterov = True)
model.compile(optimizer = sgd, loss = 'categorical_crossentropy', metrics = 'acc') 

你认为我哪里错了?

我正在分批训练我的模型,因为我正在处理 google 总共有 22K 张图像的 colab,所以这些结果是在向网络提供 2800 张训练图像后得到的。你认为如果我给它更多的图像它会自行解决,还是有什么严重的错误?

我觉得有这个损失不好。当我们加载模型并重新训练它时,在最初的几个时期有更高的损失是合乎逻辑的。然而,这个损失值不应该像你的情况那样射向星星。如果在保存时,损失值大约为 0.5,那么当您加载相同的模型进行再训练时,它不应高于先前值的 10 倍,因此,预期值为 5 +- 1。 [注意:这完全基于经验。没有通用的方法可以事先知道损失。]

如果你的损失太大,以下是合理的:

  1. 变化的数据集 - 改变训练数据集的动态可能会迫使模型出现这种行为。

  2. 模型保存可能改变了权重

建议的解决方案:

  1. 尝试在模型

    上使用save_weights而不是保存方法
     model.save_weights('path/to/filename.h5')
    

    另外,使用 load_weights 而不是 load_model

     model = call_cnn_function_to_build_model()
     model.compile(... your args ...)
     model = model.load_weights('path/to/filename.h5')
    
  2. 既然你有检查点,请尝试使用检查点保存的模型。因此,而不是最终模型,尝试从接近你最后一个时代的检查点加载模型。

PS: 感谢接受更正。