您可以从单个 anaconda 提示 运行 训练和评估过程吗?
Can you run training and evalutaion process from a single anaconda prompt?
我在训练 Tensorflow2 自定义对象检测器期间无法评估我的训练过程。在阅读了与此问题相关的几个问题后,我发现评估和培训应该被视为两个独立的过程,因此我应该使用新的 anaconda 提示来开始评估工作。
我正在训练 ssd_mobilenetv2 640x640 版本。我的管道配置:
model {
ssd {
num_classes: 6
image_resizer {
fixed_shape_resizer {
height: 640
width: 640
}
}
feature_extractor {
type: "ssd_mobilenet_v2_fpn_keras"
depth_multiplier: 1.0
min_depth: 16
conv_hyperparams {
regularizer {
l2_regularizer {
weight: 3.9999998989515007e-05
}
}
initializer {
random_normal_initializer {
mean: 0.0
stddev: 0.009999999776482582
}
}
activation: RELU_6
batch_norm {
decay: 0.996999979019165
scale: true
epsilon: 0.0010000000474974513
}
}
use_depthwise: true
override_base_feature_extractor_hyperparams: true
fpn {
min_level: 3
max_level: 7
additional_layer_depth: 128
}
}
box_coder {
faster_rcnn_box_coder {
y_scale: 10.0
x_scale: 10.0
height_scale: 5.0
width_scale: 5.0
}
}
matcher {
argmax_matcher {
matched_threshold: 0.5
unmatched_threshold: 0.5
ignore_thresholds: false
negatives_lower_than_unmatched: true
force_match_for_each_row: true
use_matmul_gather: true
}
}
similarity_calculator {
iou_similarity {
}
}
box_predictor {
weight_shared_convolutional_box_predictor {
conv_hyperparams {
regularizer {
l2_regularizer {
weight: 3.9999998989515007e-05
}
}
initializer {
random_normal_initializer {
mean: 0.0
stddev: 0.009999999776482582
}
}
activation: RELU_6
batch_norm {
decay: 0.996999979019165
scale: true
epsilon: 0.0010000000474974513
}
}
depth: 128
num_layers_before_predictor: 4
kernel_size: 3
class_prediction_bias_init: -4.599999904632568
share_prediction_tower: true
use_depthwise: true
}
}
anchor_generator {
multiscale_anchor_generator {
min_level: 3
max_level: 7
anchor_scale: 4.0
aspect_ratios: 1.0
aspect_ratios: 2.0
aspect_ratios: 0.5
scales_per_octave: 2
}
}
post_processing {
batch_non_max_suppression {
score_threshold: 9.99999993922529e-09
iou_threshold: 0.6000000238418579
max_detections_per_class: 100
max_total_detections: 100
use_static_shapes: false
}
score_converter: SIGMOID
}
normalize_loss_by_num_matches: true
loss {
localization_loss {
weighted_smooth_l1 {
}
}
classification_loss {
weighted_sigmoid_focal {
gamma: 2.0
alpha: 0.25
}
}
classification_weight: 1.0
localization_weight: 1.0
}
encode_background_as_zeros: true
normalize_loc_loss_by_codesize: true
inplace_batchnorm_update: true
freeze_batchnorm: false
}
}
train_config {
batch_size: 4
data_augmentation_options {
random_horizontal_flip {
}
}
#data_augmentation_options {
#random_crop_image {
#min_object_covered: 0.0
#min_aspect_ratio: 0.75
#max_aspect_ratio: 3.0
#min_area: 0.75
#max_area: 1.0
#overlap_thresh: 0.0
#}
#}
optimizer {
momentum_optimizer {
learning_rate {
cosine_decay_learning_rate {
learning_rate_base: 0.04999999821186066
total_steps: 50000
warmup_learning_rate: 0.0026666000485420227
warmup_steps: 600
}
}
momentum_optimizer_value: 0.8999999761581421
}
use_moving_average: false
}
fine_tune_checkpoint: "pre-trained-models\ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8\checkpoint\ckpt-0"
num_steps: 50000
startup_delay_steps: 0.0
replicas_to_aggregate: 8
max_number_of_boxes: 100
unpad_groundtruth_tensors: false
fine_tune_checkpoint_type: "detection"
fine_tune_checkpoint_version: V2
from_detection_checkpoint: true
}
train_input_reader {
label_map_path: "annotations/label_map.pbtxt"
tf_record_input_reader {
input_path: "data/train.record"
}
}
eval_config {
metrics_set: "coco_detection_metrics"
use_moving_averages: false
}
eval_input_reader {
label_map_path: "annotations/label_map.pbtxt"
shuffle: false
num_epochs: 1
tf_record_input_reader {
input_path: "data/test.record"
}
}
我已经用命令开始训练了:
python model_main_tf2.py --model_dir=models/my_ssd2_3/ --pipeline_config_path=models/my_ssd2_3/pipeline.config --sample_1_of_n_eval_examples 1 --logtostderr
我希望设置评估示例的数量会对启动评估工作产生影响。无论如何,我尝试 运行 在不同的终端 window 中进行评估:python model_main_tf2.py --model_dir=models/my_ssd2_3 --pipeline_config_path=models/my_ssd2_3/pipeline.config --checkpoint_dir=models/my_ssd2_3/ --alsologtostderr
一旦开始评估,训练作业就会崩溃并出现以下错误:error
我认为我没有足够的硬件的问题:
- 8GB 内存
- 英伟达 GTX960M(2GB 内存)
会不会是我使用的所有输入图像都是3000x3000的问题,因此预处理器必须加载太多信息?如果是这样,有什么办法可以解决吗?我不想在生成 TF 记录文件之前调整所有图像的大小,因为我必须重新标记所有图像。我显然缺乏对在训练过程开始时如何分配内存的洞察力,因此非常感谢一些细节。
第二个问题是,在监控张量板上的训练期间,图像以各种亮度显示我尝试将 model_lib_v2.py 文件中的 627 行更改为:
data= (features[fields.InputDataFields.image]-np.min(features[fields.InputDataFields.image]))/(np.max(features[fields.InputDataFields.image])-np.min(features[fields.InputDataFields.image]))
,
根据这个解决方案:https://github.com/tensorflow/models/issues/9115
没有任何运气。这个问题有解决办法吗?如果我可以在那里监视模型建议的边界框,那就太好了。谢谢。
通过对 model_lib.py 中的 train_loop 函数进行一些更改,您可以在同一应用程序中交替进行训练和评估。请参阅下面的示例。
据我了解,Tensorflow 对象检测 API 的开发重点是分布式学习,如果您使用的是多个 GPUs/TPUs,那么您可以让一些设备进行训练,而其他设备则进行训练评估。所以我怀疑model_lib.py目前的实现方式并不完全支持在同一设备上进行训练和评估。
我不确定您看到的错误的根本原因,通常我看到 Tensorflow 在出现内存问题时抛出 OOM 错误。可能是 Tensorflow 使用 CUDA 的方式不支持使用同一设备的两个应用程序。
关于你的第二个问题,我听从了建议 here on the same thread,这对我有用。复制下面第三个代码块中的代码。最初,这似乎对我不起作用,因为我天真地更新了我创建的对象检测存储库中的文件,但您的应用程序可能正在使用安装在您的站点库中的对象检测 API,所以我会建议确认您正在更改的文件与导入语句中加载的文件相同。
--
这在训练循环之外
##Set up evaluation data and writer
eval_config = configs['eval_config']
eval_input_configs = configs['eval_input_configs']
eval_input_config = eval_input_configs[0]
eval_input = strategy.experimental_distribute_dataset(
inputs.eval_input(
eval_config=eval_config,
eval_input_config=eval_input_config,
model_config=model_config,
model=detection_model))
summary_writer_eval = tf.compat.v2.summary.create_file_writer(os.path.join(model_dir, 'eval', eval_input_config.name))
这是修改后的 train/evaluation 循环。评估接近尾声。
for _ in range(global_step.value(), train_steps, num_steps_per_iteration):
tf.logging.info('Performing Training')
with summary_writer_train.as_default():
with tf.compat.v2.summary.record_if(lambda: global_step % num_steps_per_iteration == 0):
losses_dict = _dist_train_step(train_input_iter)
time_taken = time.time() - last_step_time
last_step_time = time.time()
steps_per_sec = num_steps_per_iteration * 1.0 / time_taken
tf.compat.v2.summary.scalar(
'steps_per_sec', steps_per_sec, step=global_step)
steps_per_sec_list.append(steps_per_sec)
logged_dict = losses_dict.copy()
logged_dict['learning_rate'] = learning_rate_fn()
for key, val in logged_dict.items():
tf.compat.v2.summary.scalar(key, val, step=global_step)
if global_step.value() - logged_step >= 0:
logged_dict_np = {name: value.numpy() for name, value in
logged_dict.items()}
tf.logging.info(
'Step {} per-step time {:.3f}s'.format(
global_step.value(), time_taken / num_steps_per_iteration))
tf.logging.info(pprint.pformat(logged_dict_np, width=40))
print_gpu_memory_usage()
logged_step = global_step.value()
if ((int(global_step.value()) - checkpointed_step) >=
checkpoint_every_n):
manager.save()
checkpointed_step = int(global_step.value())
tf.logging.info('Performing Evaluation')
with summary_writer_eval.as_default():
eager_eval_loop(
detection_model,
configs,
eval_input,
use_tpu=use_tpu,
global_step=global_step,
)
修复 TensorBoard 中的图像渲染
if record_summaries:
imgs = features[fields.InputDataFields.image][:3]
imgs = tf.div(tf.subtract(imgs, tf.reduce_min(imgs)), tf.subtract(tf.reduce_max(imgs), tf.reduce_min(imgs)))
tf.compat.v2.summary.image(name='train_input_images', step=global_step, data=imgs, max_outputs=3)
我在训练 Tensorflow2 自定义对象检测器期间无法评估我的训练过程。在阅读了与此问题相关的几个问题后,我发现评估和培训应该被视为两个独立的过程,因此我应该使用新的 anaconda 提示来开始评估工作。 我正在训练 ssd_mobilenetv2 640x640 版本。我的管道配置:
model {
ssd {
num_classes: 6
image_resizer {
fixed_shape_resizer {
height: 640
width: 640
}
}
feature_extractor {
type: "ssd_mobilenet_v2_fpn_keras"
depth_multiplier: 1.0
min_depth: 16
conv_hyperparams {
regularizer {
l2_regularizer {
weight: 3.9999998989515007e-05
}
}
initializer {
random_normal_initializer {
mean: 0.0
stddev: 0.009999999776482582
}
}
activation: RELU_6
batch_norm {
decay: 0.996999979019165
scale: true
epsilon: 0.0010000000474974513
}
}
use_depthwise: true
override_base_feature_extractor_hyperparams: true
fpn {
min_level: 3
max_level: 7
additional_layer_depth: 128
}
}
box_coder {
faster_rcnn_box_coder {
y_scale: 10.0
x_scale: 10.0
height_scale: 5.0
width_scale: 5.0
}
}
matcher {
argmax_matcher {
matched_threshold: 0.5
unmatched_threshold: 0.5
ignore_thresholds: false
negatives_lower_than_unmatched: true
force_match_for_each_row: true
use_matmul_gather: true
}
}
similarity_calculator {
iou_similarity {
}
}
box_predictor {
weight_shared_convolutional_box_predictor {
conv_hyperparams {
regularizer {
l2_regularizer {
weight: 3.9999998989515007e-05
}
}
initializer {
random_normal_initializer {
mean: 0.0
stddev: 0.009999999776482582
}
}
activation: RELU_6
batch_norm {
decay: 0.996999979019165
scale: true
epsilon: 0.0010000000474974513
}
}
depth: 128
num_layers_before_predictor: 4
kernel_size: 3
class_prediction_bias_init: -4.599999904632568
share_prediction_tower: true
use_depthwise: true
}
}
anchor_generator {
multiscale_anchor_generator {
min_level: 3
max_level: 7
anchor_scale: 4.0
aspect_ratios: 1.0
aspect_ratios: 2.0
aspect_ratios: 0.5
scales_per_octave: 2
}
}
post_processing {
batch_non_max_suppression {
score_threshold: 9.99999993922529e-09
iou_threshold: 0.6000000238418579
max_detections_per_class: 100
max_total_detections: 100
use_static_shapes: false
}
score_converter: SIGMOID
}
normalize_loss_by_num_matches: true
loss {
localization_loss {
weighted_smooth_l1 {
}
}
classification_loss {
weighted_sigmoid_focal {
gamma: 2.0
alpha: 0.25
}
}
classification_weight: 1.0
localization_weight: 1.0
}
encode_background_as_zeros: true
normalize_loc_loss_by_codesize: true
inplace_batchnorm_update: true
freeze_batchnorm: false
}
}
train_config {
batch_size: 4
data_augmentation_options {
random_horizontal_flip {
}
}
#data_augmentation_options {
#random_crop_image {
#min_object_covered: 0.0
#min_aspect_ratio: 0.75
#max_aspect_ratio: 3.0
#min_area: 0.75
#max_area: 1.0
#overlap_thresh: 0.0
#}
#}
optimizer {
momentum_optimizer {
learning_rate {
cosine_decay_learning_rate {
learning_rate_base: 0.04999999821186066
total_steps: 50000
warmup_learning_rate: 0.0026666000485420227
warmup_steps: 600
}
}
momentum_optimizer_value: 0.8999999761581421
}
use_moving_average: false
}
fine_tune_checkpoint: "pre-trained-models\ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8\checkpoint\ckpt-0"
num_steps: 50000
startup_delay_steps: 0.0
replicas_to_aggregate: 8
max_number_of_boxes: 100
unpad_groundtruth_tensors: false
fine_tune_checkpoint_type: "detection"
fine_tune_checkpoint_version: V2
from_detection_checkpoint: true
}
train_input_reader {
label_map_path: "annotations/label_map.pbtxt"
tf_record_input_reader {
input_path: "data/train.record"
}
}
eval_config {
metrics_set: "coco_detection_metrics"
use_moving_averages: false
}
eval_input_reader {
label_map_path: "annotations/label_map.pbtxt"
shuffle: false
num_epochs: 1
tf_record_input_reader {
input_path: "data/test.record"
}
}
我已经用命令开始训练了:
python model_main_tf2.py --model_dir=models/my_ssd2_3/ --pipeline_config_path=models/my_ssd2_3/pipeline.config --sample_1_of_n_eval_examples 1 --logtostderr
我希望设置评估示例的数量会对启动评估工作产生影响。无论如何,我尝试 运行 在不同的终端 window 中进行评估:python model_main_tf2.py --model_dir=models/my_ssd2_3 --pipeline_config_path=models/my_ssd2_3/pipeline.config --checkpoint_dir=models/my_ssd2_3/ --alsologtostderr
一旦开始评估,训练作业就会崩溃并出现以下错误:error
我认为我没有足够的硬件的问题:
- 8GB 内存
- 英伟达 GTX960M(2GB 内存)
会不会是我使用的所有输入图像都是3000x3000的问题,因此预处理器必须加载太多信息?如果是这样,有什么办法可以解决吗?我不想在生成 TF 记录文件之前调整所有图像的大小,因为我必须重新标记所有图像。我显然缺乏对在训练过程开始时如何分配内存的洞察力,因此非常感谢一些细节。
第二个问题是,在监控张量板上的训练期间,图像以各种亮度显示我尝试将 model_lib_v2.py 文件中的 627 行更改为:
data= (features[fields.InputDataFields.image]-np.min(features[fields.InputDataFields.image]))/(np.max(features[fields.InputDataFields.image])-np.min(features[fields.InputDataFields.image]))
,
根据这个解决方案:https://github.com/tensorflow/models/issues/9115 没有任何运气。这个问题有解决办法吗?如果我可以在那里监视模型建议的边界框,那就太好了。谢谢。
通过对 model_lib.py 中的 train_loop 函数进行一些更改,您可以在同一应用程序中交替进行训练和评估。请参阅下面的示例。
据我了解,Tensorflow 对象检测 API 的开发重点是分布式学习,如果您使用的是多个 GPUs/TPUs,那么您可以让一些设备进行训练,而其他设备则进行训练评估。所以我怀疑model_lib.py目前的实现方式并不完全支持在同一设备上进行训练和评估。
我不确定您看到的错误的根本原因,通常我看到 Tensorflow 在出现内存问题时抛出 OOM 错误。可能是 Tensorflow 使用 CUDA 的方式不支持使用同一设备的两个应用程序。
关于你的第二个问题,我听从了建议 here on the same thread,这对我有用。复制下面第三个代码块中的代码。最初,这似乎对我不起作用,因为我天真地更新了我创建的对象检测存储库中的文件,但您的应用程序可能正在使用安装在您的站点库中的对象检测 API,所以我会建议确认您正在更改的文件与导入语句中加载的文件相同。
--
这在训练循环之外
##Set up evaluation data and writer
eval_config = configs['eval_config']
eval_input_configs = configs['eval_input_configs']
eval_input_config = eval_input_configs[0]
eval_input = strategy.experimental_distribute_dataset(
inputs.eval_input(
eval_config=eval_config,
eval_input_config=eval_input_config,
model_config=model_config,
model=detection_model))
summary_writer_eval = tf.compat.v2.summary.create_file_writer(os.path.join(model_dir, 'eval', eval_input_config.name))
这是修改后的 train/evaluation 循环。评估接近尾声。
for _ in range(global_step.value(), train_steps, num_steps_per_iteration):
tf.logging.info('Performing Training')
with summary_writer_train.as_default():
with tf.compat.v2.summary.record_if(lambda: global_step % num_steps_per_iteration == 0):
losses_dict = _dist_train_step(train_input_iter)
time_taken = time.time() - last_step_time
last_step_time = time.time()
steps_per_sec = num_steps_per_iteration * 1.0 / time_taken
tf.compat.v2.summary.scalar(
'steps_per_sec', steps_per_sec, step=global_step)
steps_per_sec_list.append(steps_per_sec)
logged_dict = losses_dict.copy()
logged_dict['learning_rate'] = learning_rate_fn()
for key, val in logged_dict.items():
tf.compat.v2.summary.scalar(key, val, step=global_step)
if global_step.value() - logged_step >= 0:
logged_dict_np = {name: value.numpy() for name, value in
logged_dict.items()}
tf.logging.info(
'Step {} per-step time {:.3f}s'.format(
global_step.value(), time_taken / num_steps_per_iteration))
tf.logging.info(pprint.pformat(logged_dict_np, width=40))
print_gpu_memory_usage()
logged_step = global_step.value()
if ((int(global_step.value()) - checkpointed_step) >=
checkpoint_every_n):
manager.save()
checkpointed_step = int(global_step.value())
tf.logging.info('Performing Evaluation')
with summary_writer_eval.as_default():
eager_eval_loop(
detection_model,
configs,
eval_input,
use_tpu=use_tpu,
global_step=global_step,
)
修复 TensorBoard 中的图像渲染
if record_summaries:
imgs = features[fields.InputDataFields.image][:3]
imgs = tf.div(tf.subtract(imgs, tf.reduce_min(imgs)), tf.subtract(tf.reduce_max(imgs), tf.reduce_min(imgs)))
tf.compat.v2.summary.image(name='train_input_images', step=global_step, data=imgs, max_outputs=3)