是否可以从 Tensorflow 中的检查点模型恢复训练?
Is it possible to resume training from a checkpoint model in Tensorflow?
我正在做自动分割,周末我正在训练一个模型,然后停电了。我已经训练了我的模型 50 多个小时,并使用以下行每 5 个时期保存我的模型:
model_checkpoint = ModelCheckpoint('test_{epoch:04}.h5', monitor=observe_var, mode='auto', save_weights_only=False, save_best_only=False, period = 5)
我正在使用以下行加载保存的模型:
model = load_model('test_{epoch:04}.h5', custom_objects = {'dice_coef_loss': dice_coef_loss, 'dice_coef': dice_coef})
我已经包含了我的所有数据,这些数据将我的训练数据分成 train_x
用于扫描,train_y
用于标签。当我 运行 行时:
loss, dice_coef = model.evaluate(train_x, train_y, verbose=1)
我收到错误:
ResourceExhaustedError: OOM when allocating tensor with shape[32,8,128,128,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node model/conv3d_1/Conv3D (defined at <ipython-input-1-4a66b6c9f26b>:275) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[Op:__inference_distributed_function_3673]
Function call stack:
distributed_function
这基本上是你 运行 memory.So 你需要小批量评估 wise.Default 批量大小是 32 并尝试分配小批量大小。
evaluate(train_x, train_y, batch_size=<batch size>)
batch_size: Integer or None. Number of samples per gradient update. If
unspecified, batch_size will default to 32.
我正在做自动分割,周末我正在训练一个模型,然后停电了。我已经训练了我的模型 50 多个小时,并使用以下行每 5 个时期保存我的模型:
model_checkpoint = ModelCheckpoint('test_{epoch:04}.h5', monitor=observe_var, mode='auto', save_weights_only=False, save_best_only=False, period = 5)
我正在使用以下行加载保存的模型:
model = load_model('test_{epoch:04}.h5', custom_objects = {'dice_coef_loss': dice_coef_loss, 'dice_coef': dice_coef})
我已经包含了我的所有数据,这些数据将我的训练数据分成 train_x
用于扫描,train_y
用于标签。当我 运行 行时:
loss, dice_coef = model.evaluate(train_x, train_y, verbose=1)
我收到错误:
ResourceExhaustedError: OOM when allocating tensor with shape[32,8,128,128,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node model/conv3d_1/Conv3D (defined at <ipython-input-1-4a66b6c9f26b>:275) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[Op:__inference_distributed_function_3673]
Function call stack:
distributed_function
这基本上是你 运行 memory.So 你需要小批量评估 wise.Default 批量大小是 32 并尝试分配小批量大小。
evaluate(train_x, train_y, batch_size=<batch size>)
batch_size: Integer or None. Number of samples per gradient update. If unspecified, batch_size will default to 32.