TF2 对象检测 API：model_main_tf2.py - 验证丢失？

Question

在过去的 2 个月里，我一直在尝试训练一个对象检测模型，并最终成功地遵循了这个 tutorial。

这是我的 colab，其中包含我所有的作品。

问题是，显示了训练损失，并且平均下降，但验证损失却没有。

在 pipeline.config 文件中，我确实输入了评估 TFRecord 文件（我假设它是验证数据输入），如下所示：

eval_config {

metrics_set: "coco_detection_metrics"

use_moving_averages: false

}

eval_input_reader {

label_map_path: "annotations/label_map.pbtxt"

shuffle: false

num_epochs: 1

tf_record_input_reader {

input_path: "annotations/test.record"

}

}

我通读了model_main_tf2.py，它似乎在训练时没有评估，但只有在提到checkpoint_dir时才评估。

因此，我只能监控训练集上的损失，而不能监控验证集上的损失。

因此，我对过拟合或欠拟合一无所知。

你们中有人成功使用model_main_tf2.py查看验证损失吗？

此外，如果能看到训练后的 mAP 分数就好了。

我知道keras训练可以在tensorboard上看到所有这些东西，但是ODAPI似乎更难。

感谢您抽出宝贵时间，如果您仍然对某些事情感到困惑，请告诉我。

Answer 1

你必须打开另一个终端和运行这个命令

python model_main_tf2.py \
   --model_dir=models/my_ssd_resnet50_v1_fpn \
   --pipeline_config_path=models/my_ssd_resnet50_v1_fpn/pipeline.config \
   --checkpoint_dir=models/my_ssd_resnet50_v1_fpn

这个 API tutorial 在那个主题上不清楚。我遇到了完全相同的问题。

原来评估过程没有包含在训练循环中，你必须在并行中启动它。

它会等待并说 waiting for new checkpoint，这意味着您将在以下时间启动训练：

python model_main_tf2.py --model_dir=models/my_ssd_resnet50_v1_fpn --pipeline_config_path=models/my_ssd_resnet50_v1_fpn/pipeline.config # note that the checkpoint_dir argument is not there

它将在您的 eval_config.运行每 eval_interval_secs 评估一次。

根据文档，评估指标将存储在检查点旁边的 eval_0 目录中，然后您可以在 tensorboard 中绘制它。

我同意这有点难以理解，因为它在文档中不是很清楚，而且也不是很方便，因为我不得不分配另一个 GPU 来进行评估以避免 CUDA 超出内存问题。

祝你有愉快的一天

TF2 对象检测 API：model_main_tf2.py - 验证丢失？

TF2 Object Detection API: model_main_tf2.py - validation loss?

python

object-detection

tensorflow

object-detection-api

tensorflow2.0