Tensorflow 使用 MirroredStrategy() 恢复训练
Tensorflow resume training with MirroredStrategy()
我在 Linux 操作系统上训练了我的模型,因此我可以使用 MirroredStrategy()
并在 2 个 GPU 上训练。训练在纪元 610 停止。我想恢复训练,但是当我加载我的模型并对其进行评估时,内核死了。我正在使用 Jupyter 笔记本。如果我减少我的训练数据集,代码将 运行 但它只会在 1 个 GPU 上 运行。我的分发策略是否保存在我正在加载的模型中,还是我必须再次包含它?
更新
我尝试包括 MirroredStrategy()
:
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
new_model = load_model('\models\model_0610.h5',
custom_objects = {'dice_coef_loss': dice_coef_loss,
'dice_coef': dice_coef}, compile = True)
new_model.evaluate(train_x, train_y, batch_size = 2,verbose=1)
新错误
包含 MirroredStrategy()
时出错:
ValueError: 'handle' is not available outside the replica context or a 'tf.distribute.Stragety.update()' call.
源代码:
smooth = 1
def dice_coef(y_true, y_pred):
y_true_f = K.flatten(y_true)
y_pred_f = K.flatten(y_pred)
intersection = K.sum(y_true_f * y_pred_f)
return (2. * intersection + smooth) / (K.sum(y_true_f) + K.sum(y_pred_f) + smooth)
def dice_coef_loss(y_true, y_pred):
return (1. - dice_coef(y_true, y_pred))
new_model = load_model('\models\model_0610.h5',
custom_objects = {'dice_coef_loss': dice_coef_loss, 'dice_coef': dice_coef}, compile = True)
new_model.evaluate(train_x, train_y, batch_size = 2,verbose=1)
observe_var = 'dice_coef'
strategy = 'max' # greater dice_coef is better
model_resume_dir = '//models_resume//'
model_checkpoint = ModelCheckpoint(model_resume_dir + 'resume_{epoch:04}.h5',
monitor=observe_var, mode='auto', save_weights_only=False,
save_best_only=False, period = 2)
new_model.fit(train_x, train_y, batch_size = 2, epochs = 5000, verbose=1, shuffle = True,
validation_split = .15, callbacks = [model_checkpoint])
new_model.save(model_resume_dir + 'final_resume.h5')
加载模型时 new_model.evaluate()
和 compile = True
导致了问题。我设置了 compile = False
并从我的原始脚本中添加了一个编译行。
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
new_model = load_model('\models\model_0610.h5',
custom_objects = {'dice_coef_loss': dice_coef_loss,
'dice_coef': dice_coef}, compile = False)
new_model.compile(optimizer = Adam(learning_rate = 1e-4, loss = dice_coef_loss,
metrics = [dice_coef])
我在 Linux 操作系统上训练了我的模型,因此我可以使用 MirroredStrategy()
并在 2 个 GPU 上训练。训练在纪元 610 停止。我想恢复训练,但是当我加载我的模型并对其进行评估时,内核死了。我正在使用 Jupyter 笔记本。如果我减少我的训练数据集,代码将 运行 但它只会在 1 个 GPU 上 运行。我的分发策略是否保存在我正在加载的模型中,还是我必须再次包含它?
更新
我尝试包括 MirroredStrategy()
:
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
new_model = load_model('\models\model_0610.h5',
custom_objects = {'dice_coef_loss': dice_coef_loss,
'dice_coef': dice_coef}, compile = True)
new_model.evaluate(train_x, train_y, batch_size = 2,verbose=1)
新错误
包含 MirroredStrategy()
时出错:
ValueError: 'handle' is not available outside the replica context or a 'tf.distribute.Stragety.update()' call.
源代码:
smooth = 1
def dice_coef(y_true, y_pred):
y_true_f = K.flatten(y_true)
y_pred_f = K.flatten(y_pred)
intersection = K.sum(y_true_f * y_pred_f)
return (2. * intersection + smooth) / (K.sum(y_true_f) + K.sum(y_pred_f) + smooth)
def dice_coef_loss(y_true, y_pred):
return (1. - dice_coef(y_true, y_pred))
new_model = load_model('\models\model_0610.h5',
custom_objects = {'dice_coef_loss': dice_coef_loss, 'dice_coef': dice_coef}, compile = True)
new_model.evaluate(train_x, train_y, batch_size = 2,verbose=1)
observe_var = 'dice_coef'
strategy = 'max' # greater dice_coef is better
model_resume_dir = '//models_resume//'
model_checkpoint = ModelCheckpoint(model_resume_dir + 'resume_{epoch:04}.h5',
monitor=observe_var, mode='auto', save_weights_only=False,
save_best_only=False, period = 2)
new_model.fit(train_x, train_y, batch_size = 2, epochs = 5000, verbose=1, shuffle = True,
validation_split = .15, callbacks = [model_checkpoint])
new_model.save(model_resume_dir + 'final_resume.h5')
new_model.evaluate()
和 compile = True
导致了问题。我设置了 compile = False
并从我的原始脚本中添加了一个编译行。
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
new_model = load_model('\models\model_0610.h5',
custom_objects = {'dice_coef_loss': dice_coef_loss,
'dice_coef': dice_coef}, compile = False)
new_model.compile(optimizer = Adam(learning_rate = 1e-4, loss = dice_coef_loss,
metrics = [dice_coef])