使用 Faster RCNN Inception Resnet 进行迁移学习 |为什么在第一个检查点之后为每个步骤保存新的检查点?
Transfer Learning with a Faster RCNN Inception Resnet | Why new checkpoints get saved for each step after a first checkpoint?
我有大约 24000 张 1920x384 宽屏格式的图像,我想通过将我的图像数据集中可用的六个 类 对象训练到 faster_rcnn_inception_resnet_v2_atrous_coco[= 上来进行迁移学习31=] 网络,在我从 tensorflow model zoo.
下载的 COCO 数据集上预训练
我使用 here 中的相应配置文件,我更改了该文件(尽管我的训练和验证路径 *.tfrecords
以下列方式
num_classes: 6 # adjustment to my number of classes
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 288 # rescaling to 75% of the minimum dimension of the images in my dataset
max_dimension: 1440 # rescaling to 75% of the maximum dimension of the images in my dataset
}
}
开始训练效果很好
Starting Training...
WARNING:tensorflow:Forced number of epochs for all eval validations to be 1.
INFO:tensorflow:Maybe overwriting train_steps: None
INFO:tensorflow:Maybe overwriting sample_1_of_n_eval_examples: 1
INFO:tensorflow:Maybe overwriting eval_num_epochs: 1
INFO:tensorflow:Maybe overwriting load_pretrained: True
INFO:tensorflow:Ignoring config override key: load_pretrained
WARNING:tensorflow:Expected number of evaluation epochs is 1, but instead encountered `eval_on_train_input_config.num_epochs` = 0. Overwriting `num_epochs` to 1.
INFO:tensorflow:create_estimator_and_inputs: use_tpu False, export_to_tpu False
INFO:tensorflow:Using config: {'_model_dir': 'C:\191011_faster_rcnn_inception_resnet_v2_atrous_coco_transfer_learning', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x0000013C8555B4A8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
WARNING:tensorflow:Estimator''s model_fn (<function create_model_fn.<locals>.model_fn at 0x0000013C85559AE8>) includes params argument, but params are not passed to Estimator.
INFO:tensorflow:Not using Distribute Coordinator.
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint.
Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps None or save_checkpoints_secs 600.
WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards.
WARNING:tensorflow:From C:\Users\myuser\Projects\models\research\object_detection\builders\dataset_builder.py:80: parallel_interleave (from tensorflow.contrib.data.python.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.parallel_interleave(...)`.
WARNING:tensorflow:From C:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\ops\sparse_ops.py:1165: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.
WARNING:tensorflow:From C:\Users\myuser\Projects\models\research\object_detection\builders\dataset_builder.py:148: batch_and_drop_remainder (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.batch(..., drop_remainder=True)`.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
WARNING:tensorflow:From C:\Users\myuser\Projects\models\research\object_detection\predictors\heads\box_head.py:93: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
WARNING:tensorflow:From C:\Users\myuser\Projects\models\research\object_detection\meta_architectures\faster_rcnn_meta_arch.py:2236: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
WARNING:root:Variable [SecondStageBoxPredictor/BoxEncodingPredictor/biases] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[360]], model variable shape: [[24]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [SecondStageBoxPredictor/BoxEncodingPredictor/weights] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[1536, 360]], model variable shape: [[1536, 24]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [SecondStageBoxPredictor/ClassPredictor/biases] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[91]], model variable shape: [[7]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [SecondStageBoxPredictor/ClassPredictor/weights] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[1536, 91]], model variable shape: [[1536, 7]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [global_step] is not available in checkpoint
WARNING:tensorflow:From C:\Users\myuser\Projects\models\research\object_detection\core\losses.py:345: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Future major versions of TensorFlow will allow gradients to flow into the labels input on backprop by default.
See `tf.nn.softmax_cross_entropy_with_logits_v2`.
WARNING:tensorflow:From C:\Users\myuser\Projects\models\research\object_detection\core\losses.py:345: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Future major versions of TensorFlow will allow gradients to flow into the labels input on backprop by default.
See `tf.nn.softmax_cross_entropy_with_logits_v2`.
C:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\ops\gradients_impl.py:112: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Graph was finalized.
2019-10-11 08:11:42.427791: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2019-10-11 08:11:43.075302: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0001:00:00.0
totalMemory: 15.90GiB freeMemory: 15.26GiB
2019-10-11 08:11:43.075684: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible
gpu devices: 0
2019-10-11 08:11:43.524992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-10-11 08:11:43.525209: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-10-11 08:11:43.525324: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-10-11 08:11:43.525795: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14763 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0001:00:00.0, compute capability: 7.0)
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into C:\191011_faster_rcnn_inception_resnet_v2_atrous_coco_transfer_learning\model.ckpt.
INFO:tensorflow:Saving checkpoints for 0 into C:\191011_faster_rcnn_inception_resnet_v2_atrous_coco_transfer_learning\model.ckpt.
INFO:tensorflow:loss = 2.5189617, step = 0
INFO:tensorflow:loss = 2.5189617, step = 0
INFO:tensorflow:global_step/sec: 1.6828
INFO:tensorflow:global_step/sec: 1.6828
INFO:tensorflow:loss = 1.5950212, step = 100 (59.456 sec)
INFO:tensorflow:loss = 1.5950212, step = 100 (59.456 sec)
INFO:tensorflow:global_step/sec: 2.00219
INFO:tensorflow:global_step/sec: 2.00219
INFO:tensorflow:loss = 0.8909993, step = 200 (49.914 sec)
INFO:tensorflow:loss = 0.8909993, step = 200 (49.914 sec)
....
.... # lines skipped
....
INFO:tensorflow:global_step/sec: 2.04283
INFO:tensorflow:global_step/sec: 2.04283
INFO:tensorflow:loss = 0.2713771, step = 1100 (48.933 sec)
INFO:tensorflow:loss = 0.2713771, step = 1100 (48.933 sec)
INFO:tensorflow:Saving checkpoints for 1162 into C:\191011_faster_rcnn_inception_resnet_v2_atrous_coco_transfer_learning\model.ckpt.
INFO:tensorflow:Saving checkpoints for 1162 into C:\191011_faster_rcnn_inception_resnet_v2_atrous_coco_transfer_learning\model.ckpt.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-10-11-06:25:05
INFO:tensorflow:Starting evaluation at 2019-10-11-06:25:05
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Graph was finalized.
2019-10-11 08:25:07.753227: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible
gpu devices: 0
2019-10-11 08:25:07.753427: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device
interconnect StreamExecutor with strength 1 edge matrix:
2019-10-11 08:25:07.753615: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-10-11 08:25:07.753741: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-10-11 08:25:07.754137: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14763 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0001:00:00.0, compute capability: 7.0)
INFO:tensorflow:Restoring parameters from C:\191011_faster_rcnn_inception_resnet_v2_atrous_coco_transfer_learning\model.ckpt-1162
INFO:tensorflow:Restoring parameters from C:\191011_faster_rcnn_inception_resnet_v2_atrous_coco_transfer_learning\model.ckpt-1162
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Loading and preparing annotation results...
INFO:tensorflow:Loading and preparing annotation results...
creating index...
index created!
INFO:tensorflow:DONE (t=0.17s)
INFO:tensorflow:DONE (t=0.17s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=11.83s).
Accumulating evaluation results...
DONE (t=5.48s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.709
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.981
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.904
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.605
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.728
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.794
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.768
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.774
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.775
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.700
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.787
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.835
INFO:tensorflow:Finished evaluation at 2019-10-11-06:40:41
INFO:tensorflow:Finished evaluation at 2019-10-11-06:40:41
INFO:tensorflow:Saving dict for global step 1162: DetectionBoxes_Precision/mAP = 0.70930076,
DetectionBoxes_Precision/mAP (large) = 0.7941316, DetectionBoxes_Precision/mAP (medium) = 0.7282758,
DetectionBoxes_Precision/mAP (small) = 0.6049327, DetectionBoxes_Precision/mAP@.50IOU = 0.98051566,
DetectionBoxes_Precision/mAP@.75IOU = 0.9042774, DetectionBoxes_Recall/AR@1 = 0.7676365,
DetectionBoxes_Recall/AR@10 = 0.77410305, DetectionBoxes_Recall/AR@100 = 0.7745228,
DetectionBoxes_Recall/AR@100 (large) = 0.8347223, DetectionBoxes_Recall/AR@100 (medium) = 0.78670675, DetectionBoxes_Recall/AR@100 (small) = 0.69985116,
Loss/BoxClassifierLoss/classification_loss = 0.0749631, Loss/BoxClassifierLoss/localization_loss = 0.048301302, Loss/RPNLoss/localization_loss = 0.096785806, Loss/RPNLoss/objectness_loss = 0.0898837, Loss/total_loss = 0.30993363, global_step = 1162, learning_rate = 0.0003, loss = 0.30993363
INFO:tensorflow:Saving dict for global step 1162: DetectionBoxes_Precision/mAP = 0.70930076,
DetectionBoxes_Precision/mAP (large) = 0.7941316, DetectionBoxes_Precision/mAP (medium) = 0.7282758,
DetectionBoxes_Precision/mAP (small) = 0.6049327, DetectionBoxes_Precision/mAP@.50IOU = 0.98051566, DetectionBoxes_Precision/mAP@.75IOU = 0.9042774, DetectionBoxes_Recall/AR@1 = 0.7676365, DetectionBoxes_Recall/AR@10 = 0.77410305, DetectionBoxes_Recall/AR@100 = 0.7745228, DetectionBoxes_Recall/AR@100 (large) = 0.8347223, DetectionBoxes_Recall/AR@100 (medium) = 0.78670675, DetectionBoxes_Recall/AR@100 (small) = 0.69985116, Loss/BoxClassifierLoss/classification_loss = 0.0749631, Loss/BoxClassifierLoss/localization_loss = 0.048301302, Loss/RPNLoss/localization_loss = 0.096785806, Loss/RPNLoss/objectness_loss = 0.0898837, Loss/total_loss = 0.30993363, global_step = 1162, learning_rate = 0.0003, loss = 0.30993363
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 1162: C:\191011_faster_rcnn_inception_resnet_v2_atrous_coco_transfer_learning\model.ckpt-1162
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 1162: C:\191011_faster_rcnn_inception_resnet_v2_atrous_coco_transfer_learning\model.ckpt-1162
因此,正如您所看到的,训练执行了 1162 步并按应有的方式保存了一个检查点(我假设因为 600 秒的关键 _save_checkpoints_secs
已经结束。现在方面开始了,我不明白。它不是现在开始计算下一千步直到下一个检查点,而是立即为下一步保存一个检查点 1163
INFO:tensorflow:Saving checkpoints for 1163 into C:\191011_faster_rcnn_inception_resnet_v2_atrous_coco_transfer_learning\model.ckpt.
INFO:tensorflow:Saving checkpoints for 1163 into C:\191011_faster_rcnn_inception_resnet_v2_atrous_coco_transfer_learning\model.ckpt.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-10-11-06:42:22
INFO:tensorflow:Starting evaluation at 2019-10-11-06:42:22
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Graph was finalized.
2019-10-11 08:42:23.981937: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible
gpu devices: 0
2019-10-11 08:42:23.982106: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device
interconnect StreamExecutor with strength 1 edge matrix:
2019-10-11 08:42:23.982290: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-10-11 08:42:23.982405: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-10-11 08:42:23.982784: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14763 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0001:00:00.0, compute capability: 7.0)
INFO:tensorflow:Restoring parameters from C:\191011_faster_rcnn_inception_resnet_v2_atrous_coco_transfer_learning\model.ckpt-1163
INFO:tensorflow:Restoring parameters from C:\191011_faster_rcnn_inception_resnet_v2_atrous_coco_transfer_learning\model.ckpt-1163
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
...
... # continues with saving checkpoints for all upcoming steps
...
这对于步骤 1164 和任何后续步骤类似地进行。问题是:为什么训练开始在每一步都保存检查点 AFTER 它在肯定更多的步骤之前做了第一个检查点?
附加信息:我已经使用 ssd_resnet_50_fpn_coco 网络进行了训练,效果很好。
在运行遇到同样的问题后,我找到了解决方案here:
问题可能是验证 运行 花费的时间太长。 600 秒根本不够,仅在一个训练步骤后,就会执行新的验证。这样,培训将永远进行下去,因为大部分时间都用于验证。
要使其正常工作,必须向
添加一个附加参数
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir)
在 models/research/object_detection/model_main.py
文件中。添加(两者都不起作用)参数 save_checkpoints_steps
或 save_checkpoints_secs
。在创建检查点并执行相应的验证之前,可以选择步数或时间量。
例如:
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir, save_checkpoints_steps=2000)
将每 2000 个训练步骤保存一个检查点和 运行 一次验证。
我有大约 24000 张 1920x384 宽屏格式的图像,我想通过将我的图像数据集中可用的六个 类 对象训练到 faster_rcnn_inception_resnet_v2_atrous_coco[= 上来进行迁移学习31=] 网络,在我从 tensorflow model zoo.
下载的 COCO 数据集上预训练我使用 here 中的相应配置文件,我更改了该文件(尽管我的训练和验证路径 *.tfrecords
以下列方式
num_classes: 6 # adjustment to my number of classes
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 288 # rescaling to 75% of the minimum dimension of the images in my dataset
max_dimension: 1440 # rescaling to 75% of the maximum dimension of the images in my dataset
}
}
开始训练效果很好
Starting Training...
WARNING:tensorflow:Forced number of epochs for all eval validations to be 1.
INFO:tensorflow:Maybe overwriting train_steps: None
INFO:tensorflow:Maybe overwriting sample_1_of_n_eval_examples: 1
INFO:tensorflow:Maybe overwriting eval_num_epochs: 1
INFO:tensorflow:Maybe overwriting load_pretrained: True
INFO:tensorflow:Ignoring config override key: load_pretrained
WARNING:tensorflow:Expected number of evaluation epochs is 1, but instead encountered `eval_on_train_input_config.num_epochs` = 0. Overwriting `num_epochs` to 1.
INFO:tensorflow:create_estimator_and_inputs: use_tpu False, export_to_tpu False
INFO:tensorflow:Using config: {'_model_dir': 'C:\191011_faster_rcnn_inception_resnet_v2_atrous_coco_transfer_learning', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x0000013C8555B4A8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
WARNING:tensorflow:Estimator''s model_fn (<function create_model_fn.<locals>.model_fn at 0x0000013C85559AE8>) includes params argument, but params are not passed to Estimator.
INFO:tensorflow:Not using Distribute Coordinator.
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint.
Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps None or save_checkpoints_secs 600.
WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards.
WARNING:tensorflow:From C:\Users\myuser\Projects\models\research\object_detection\builders\dataset_builder.py:80: parallel_interleave (from tensorflow.contrib.data.python.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.parallel_interleave(...)`.
WARNING:tensorflow:From C:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\ops\sparse_ops.py:1165: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.
WARNING:tensorflow:From C:\Users\myuser\Projects\models\research\object_detection\builders\dataset_builder.py:148: batch_and_drop_remainder (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.batch(..., drop_remainder=True)`.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
WARNING:tensorflow:From C:\Users\myuser\Projects\models\research\object_detection\predictors\heads\box_head.py:93: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
WARNING:tensorflow:From C:\Users\myuser\Projects\models\research\object_detection\meta_architectures\faster_rcnn_meta_arch.py:2236: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
WARNING:root:Variable [SecondStageBoxPredictor/BoxEncodingPredictor/biases] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[360]], model variable shape: [[24]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [SecondStageBoxPredictor/BoxEncodingPredictor/weights] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[1536, 360]], model variable shape: [[1536, 24]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [SecondStageBoxPredictor/ClassPredictor/biases] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[91]], model variable shape: [[7]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [SecondStageBoxPredictor/ClassPredictor/weights] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[1536, 91]], model variable shape: [[1536, 7]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [global_step] is not available in checkpoint
WARNING:tensorflow:From C:\Users\myuser\Projects\models\research\object_detection\core\losses.py:345: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Future major versions of TensorFlow will allow gradients to flow into the labels input on backprop by default.
See `tf.nn.softmax_cross_entropy_with_logits_v2`.
WARNING:tensorflow:From C:\Users\myuser\Projects\models\research\object_detection\core\losses.py:345: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Future major versions of TensorFlow will allow gradients to flow into the labels input on backprop by default.
See `tf.nn.softmax_cross_entropy_with_logits_v2`.
C:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\ops\gradients_impl.py:112: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Graph was finalized.
2019-10-11 08:11:42.427791: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2019-10-11 08:11:43.075302: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0001:00:00.0
totalMemory: 15.90GiB freeMemory: 15.26GiB
2019-10-11 08:11:43.075684: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible
gpu devices: 0
2019-10-11 08:11:43.524992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-10-11 08:11:43.525209: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-10-11 08:11:43.525324: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-10-11 08:11:43.525795: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14763 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0001:00:00.0, compute capability: 7.0)
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into C:\191011_faster_rcnn_inception_resnet_v2_atrous_coco_transfer_learning\model.ckpt.
INFO:tensorflow:Saving checkpoints for 0 into C:\191011_faster_rcnn_inception_resnet_v2_atrous_coco_transfer_learning\model.ckpt.
INFO:tensorflow:loss = 2.5189617, step = 0
INFO:tensorflow:loss = 2.5189617, step = 0
INFO:tensorflow:global_step/sec: 1.6828
INFO:tensorflow:global_step/sec: 1.6828
INFO:tensorflow:loss = 1.5950212, step = 100 (59.456 sec)
INFO:tensorflow:loss = 1.5950212, step = 100 (59.456 sec)
INFO:tensorflow:global_step/sec: 2.00219
INFO:tensorflow:global_step/sec: 2.00219
INFO:tensorflow:loss = 0.8909993, step = 200 (49.914 sec)
INFO:tensorflow:loss = 0.8909993, step = 200 (49.914 sec)
....
.... # lines skipped
....
INFO:tensorflow:global_step/sec: 2.04283
INFO:tensorflow:global_step/sec: 2.04283
INFO:tensorflow:loss = 0.2713771, step = 1100 (48.933 sec)
INFO:tensorflow:loss = 0.2713771, step = 1100 (48.933 sec)
INFO:tensorflow:Saving checkpoints for 1162 into C:\191011_faster_rcnn_inception_resnet_v2_atrous_coco_transfer_learning\model.ckpt.
INFO:tensorflow:Saving checkpoints for 1162 into C:\191011_faster_rcnn_inception_resnet_v2_atrous_coco_transfer_learning\model.ckpt.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-10-11-06:25:05
INFO:tensorflow:Starting evaluation at 2019-10-11-06:25:05
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Graph was finalized.
2019-10-11 08:25:07.753227: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible
gpu devices: 0
2019-10-11 08:25:07.753427: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device
interconnect StreamExecutor with strength 1 edge matrix:
2019-10-11 08:25:07.753615: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-10-11 08:25:07.753741: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-10-11 08:25:07.754137: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14763 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0001:00:00.0, compute capability: 7.0)
INFO:tensorflow:Restoring parameters from C:\191011_faster_rcnn_inception_resnet_v2_atrous_coco_transfer_learning\model.ckpt-1162
INFO:tensorflow:Restoring parameters from C:\191011_faster_rcnn_inception_resnet_v2_atrous_coco_transfer_learning\model.ckpt-1162
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Loading and preparing annotation results...
INFO:tensorflow:Loading and preparing annotation results...
creating index...
index created!
INFO:tensorflow:DONE (t=0.17s)
INFO:tensorflow:DONE (t=0.17s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=11.83s).
Accumulating evaluation results...
DONE (t=5.48s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.709
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.981
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.904
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.605
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.728
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.794
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.768
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.774
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.775
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.700
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.787
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.835
INFO:tensorflow:Finished evaluation at 2019-10-11-06:40:41
INFO:tensorflow:Finished evaluation at 2019-10-11-06:40:41
INFO:tensorflow:Saving dict for global step 1162: DetectionBoxes_Precision/mAP = 0.70930076,
DetectionBoxes_Precision/mAP (large) = 0.7941316, DetectionBoxes_Precision/mAP (medium) = 0.7282758,
DetectionBoxes_Precision/mAP (small) = 0.6049327, DetectionBoxes_Precision/mAP@.50IOU = 0.98051566,
DetectionBoxes_Precision/mAP@.75IOU = 0.9042774, DetectionBoxes_Recall/AR@1 = 0.7676365,
DetectionBoxes_Recall/AR@10 = 0.77410305, DetectionBoxes_Recall/AR@100 = 0.7745228,
DetectionBoxes_Recall/AR@100 (large) = 0.8347223, DetectionBoxes_Recall/AR@100 (medium) = 0.78670675, DetectionBoxes_Recall/AR@100 (small) = 0.69985116,
Loss/BoxClassifierLoss/classification_loss = 0.0749631, Loss/BoxClassifierLoss/localization_loss = 0.048301302, Loss/RPNLoss/localization_loss = 0.096785806, Loss/RPNLoss/objectness_loss = 0.0898837, Loss/total_loss = 0.30993363, global_step = 1162, learning_rate = 0.0003, loss = 0.30993363
INFO:tensorflow:Saving dict for global step 1162: DetectionBoxes_Precision/mAP = 0.70930076,
DetectionBoxes_Precision/mAP (large) = 0.7941316, DetectionBoxes_Precision/mAP (medium) = 0.7282758,
DetectionBoxes_Precision/mAP (small) = 0.6049327, DetectionBoxes_Precision/mAP@.50IOU = 0.98051566, DetectionBoxes_Precision/mAP@.75IOU = 0.9042774, DetectionBoxes_Recall/AR@1 = 0.7676365, DetectionBoxes_Recall/AR@10 = 0.77410305, DetectionBoxes_Recall/AR@100 = 0.7745228, DetectionBoxes_Recall/AR@100 (large) = 0.8347223, DetectionBoxes_Recall/AR@100 (medium) = 0.78670675, DetectionBoxes_Recall/AR@100 (small) = 0.69985116, Loss/BoxClassifierLoss/classification_loss = 0.0749631, Loss/BoxClassifierLoss/localization_loss = 0.048301302, Loss/RPNLoss/localization_loss = 0.096785806, Loss/RPNLoss/objectness_loss = 0.0898837, Loss/total_loss = 0.30993363, global_step = 1162, learning_rate = 0.0003, loss = 0.30993363
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 1162: C:\191011_faster_rcnn_inception_resnet_v2_atrous_coco_transfer_learning\model.ckpt-1162
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 1162: C:\191011_faster_rcnn_inception_resnet_v2_atrous_coco_transfer_learning\model.ckpt-1162
因此,正如您所看到的,训练执行了 1162 步并按应有的方式保存了一个检查点(我假设因为 600 秒的关键 _save_checkpoints_secs
已经结束。现在方面开始了,我不明白。它不是现在开始计算下一千步直到下一个检查点,而是立即为下一步保存一个检查点 1163
INFO:tensorflow:Saving checkpoints for 1163 into C:\191011_faster_rcnn_inception_resnet_v2_atrous_coco_transfer_learning\model.ckpt.
INFO:tensorflow:Saving checkpoints for 1163 into C:\191011_faster_rcnn_inception_resnet_v2_atrous_coco_transfer_learning\model.ckpt.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-10-11-06:42:22
INFO:tensorflow:Starting evaluation at 2019-10-11-06:42:22
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Graph was finalized.
2019-10-11 08:42:23.981937: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible
gpu devices: 0
2019-10-11 08:42:23.982106: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device
interconnect StreamExecutor with strength 1 edge matrix:
2019-10-11 08:42:23.982290: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-10-11 08:42:23.982405: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-10-11 08:42:23.982784: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14763 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0001:00:00.0, compute capability: 7.0)
INFO:tensorflow:Restoring parameters from C:\191011_faster_rcnn_inception_resnet_v2_atrous_coco_transfer_learning\model.ckpt-1163
INFO:tensorflow:Restoring parameters from C:\191011_faster_rcnn_inception_resnet_v2_atrous_coco_transfer_learning\model.ckpt-1163
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
...
... # continues with saving checkpoints for all upcoming steps
...
这对于步骤 1164 和任何后续步骤类似地进行。问题是:为什么训练开始在每一步都保存检查点 AFTER 它在肯定更多的步骤之前做了第一个检查点?
附加信息:我已经使用 ssd_resnet_50_fpn_coco 网络进行了训练,效果很好。
在运行遇到同样的问题后,我找到了解决方案here:
问题可能是验证 运行 花费的时间太长。 600 秒根本不够,仅在一个训练步骤后,就会执行新的验证。这样,培训将永远进行下去,因为大部分时间都用于验证。
要使其正常工作,必须向
添加一个附加参数config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir)
在 models/research/object_detection/model_main.py
文件中。添加(两者都不起作用)参数 save_checkpoints_steps
或 save_checkpoints_secs
。在创建检查点并执行相应的验证之前,可以选择步数或时间量。
例如:
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir, save_checkpoints_steps=2000)
将每 2000 个训练步骤保存一个检查点和 运行 一次验证。