使用带有 save_freq 作为整数的 ModelCheckpoint() 时,如何使用纪元或批号创建检查点文件名?
How to create checkpoint filenames with epoch or batch number when using ModelCheckpoint() with save_freq as interger?
我安装了 tensorflow 2 v.2.5.0 并使用带有 python 3.10 的 jupyter 笔记本。
我正在练习使用参数 save_freq 作为在线课程中的整数(他们使用 tensorflow 2.0.0,其中以下代码运行良好,但它在我的最新版本中确实有效)。
这里是相关文档的 link,但没有在 save_freq 中使用整数的示例。
https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ModelCheckpoint
这是我的代码:
import tensorflow as tf
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPooling2D
# Use the CIFAR-10 dataset
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
x_train = x_train / 255.0
x_test = x_test / 255.0
# using a smaller subset -- speeds things up
x_train = x_train[:10000]
y_train = y_train[:10000]
x_test = x_test[:1000]
y_test = y_test[:1000]
# define a function that creates a new instance of a simple CNN.
def create_model():
model = Sequential([
Conv2D(filters=16, input_shape=(32, 32, 3), kernel_size=(3, 3),
activation='relu', name='conv_1'),
Conv2D(filters=8, kernel_size=(3, 3), activation='relu', name='conv_2'),
MaxPooling2D(pool_size=(4, 4), name='pool_1'),
Flatten(name='flatten'),
Dense(units=32, activation='relu', name='dense_1'),
Dense(units=10, activation='softmax', name='dense_2')
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
return model
# Create Tensorflow checkpoint object with epoch and batch details
checkpoint_5000_path = 'model_checkpoints_5000/cp_{epoch:02d}-{batch:04d}'
checkpoint_5000 = ModelCheckpoint(filepath = checkpoint_5000_path,
save_weights_only = True,
save_freq = 5000,
verbose = 1)
# Create and fit model with checkpoint
model = create_model()
model.fit(x = x_train,
y = y_train,
epochs = 3,
validation_data = (x_test, y_test),
batch_size = 10,
callbacks = [checkpoint_5000])
我想创建并保存检查点文件名,包括纪元和批号。
但是,文件未创建,它写入 'File not found'。在我手动创建目录后,model_checkpoints_5000,没有添加任何文件。
(我们可以通过 运行 '!dir -a model_checkpoints_5000' (in windows) 或 'ls -lh model_checkpoints_500' (in linux)).
我也试过改成'model_checkpoints_5000/cp_{epoch:02d}',还是没有保存每个epoch的文件。
然后我尝试按照 save_freq 的 Checkpoint Callback options 中的示例进行操作,这会与我一起保存文件。
https://www.tensorflow.org/tutorials/keras/save_and_load
然而,它仍然没有保存我的任何文件。
checkpoint_path = "model_checkpoints_5000/cp-{epoch:02d}.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)
batch_size = 10
checkpoint_5000 = ModelCheckpoint(filepath = checkpoint_path,
save_weights_only = True,
save_freq = 500*batch_size,
model = create_model()
model.fit(x = x_train,
y = y_train,
epochs = 3,
validation_data = (x_test, y_test),
batch_size = batch_size,
callbacks = [checkpoint_5000]) verbose = 1)
有什么建议可以让它发挥作用吗?除了降级我的 tensorflow。
参数save_freg
太大。它需要等于或小于 save_freg = training_samples // batch_size
。也许尝试这样的事情:
batch_size = 10
checkpoint_5000_path = 'model_checkpoints_5000/cp_{epoch:02d}-{batch:1d}'
checkpoint_5000 = ModelCheckpoint(filepath = checkpoint_5000_path,
save_weights_only = True,
save_freq = len(x_train) // batch_size // batch_size,
verbose = 1)
model = create_model()
model.fit(x = x_train,
y = y_train,
epochs = 3,
validation_data = (x_test, y_test),
batch_size = batch_size,
callbacks = [checkpoint_5000])
Epoch 1/3
97/1000 [=>............................] - ETA: 3s - loss: 2.2801 - accuracy: 0.1536
Epoch 00001: saving model to model_checkpoints_5000/cp_01-100
198/1000 [====>.........................] - ETA: 3s - loss: 2.2347 - accuracy: 0.1500
Epoch 00001: saving model to model_checkpoints_5000/cp_01-200
288/1000 [=======>......................] - ETA: 3s - loss: 2.1979 - accuracy: 0.1736
Epoch 00001: saving model to model_checkpoints_5000/cp_01-300
397/1000 [==========>...................] - ETA: 2s - loss: 2.1337 - accuracy: 0.2020
Epoch 00001: saving model to model_checkpoints_5000/cp_01-400
497/1000 [=============>................] - ETA: 2s - loss: 2.0952 - accuracy: 0.2197
Epoch 00001: saving model to model_checkpoints_5000/cp_01-500
598/1000 [================>.............] - ETA: 1s - loss: 2.0496 - accuracy: 0.2395
Epoch 00001: saving model to model_checkpoints_5000/cp_01-600
698/1000 [===================>..........] - ETA: 1s - loss: 2.0122 - accuracy: 0.2520
Epoch 00001: saving model to model_checkpoints_5000/cp_01-700
703/1000 [====================>.........] - ETA: 1s - loss: 2.0082 - accuracy: 0.2538
...
在此示例中,每个纪元每 x 步创建一个检查点。
我安装了 tensorflow 2 v.2.5.0 并使用带有 python 3.10 的 jupyter 笔记本。
我正在练习使用参数 save_freq 作为在线课程中的整数(他们使用 tensorflow 2.0.0,其中以下代码运行良好,但它在我的最新版本中确实有效)。
这里是相关文档的 link,但没有在 save_freq 中使用整数的示例。 https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ModelCheckpoint
这是我的代码:
import tensorflow as tf
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPooling2D
# Use the CIFAR-10 dataset
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
x_train = x_train / 255.0
x_test = x_test / 255.0
# using a smaller subset -- speeds things up
x_train = x_train[:10000]
y_train = y_train[:10000]
x_test = x_test[:1000]
y_test = y_test[:1000]
# define a function that creates a new instance of a simple CNN.
def create_model():
model = Sequential([
Conv2D(filters=16, input_shape=(32, 32, 3), kernel_size=(3, 3),
activation='relu', name='conv_1'),
Conv2D(filters=8, kernel_size=(3, 3), activation='relu', name='conv_2'),
MaxPooling2D(pool_size=(4, 4), name='pool_1'),
Flatten(name='flatten'),
Dense(units=32, activation='relu', name='dense_1'),
Dense(units=10, activation='softmax', name='dense_2')
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
return model
# Create Tensorflow checkpoint object with epoch and batch details
checkpoint_5000_path = 'model_checkpoints_5000/cp_{epoch:02d}-{batch:04d}'
checkpoint_5000 = ModelCheckpoint(filepath = checkpoint_5000_path,
save_weights_only = True,
save_freq = 5000,
verbose = 1)
# Create and fit model with checkpoint
model = create_model()
model.fit(x = x_train,
y = y_train,
epochs = 3,
validation_data = (x_test, y_test),
batch_size = 10,
callbacks = [checkpoint_5000])
我想创建并保存检查点文件名,包括纪元和批号。 但是,文件未创建,它写入 'File not found'。在我手动创建目录后,model_checkpoints_5000,没有添加任何文件。
(我们可以通过 运行 '!dir -a model_checkpoints_5000' (in windows) 或 'ls -lh model_checkpoints_500' (in linux)).
我也试过改成'model_checkpoints_5000/cp_{epoch:02d}',还是没有保存每个epoch的文件。
然后我尝试按照 save_freq 的 Checkpoint Callback options 中的示例进行操作,这会与我一起保存文件。 https://www.tensorflow.org/tutorials/keras/save_and_load
然而,它仍然没有保存我的任何文件。
checkpoint_path = "model_checkpoints_5000/cp-{epoch:02d}.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)
batch_size = 10
checkpoint_5000 = ModelCheckpoint(filepath = checkpoint_path,
save_weights_only = True,
save_freq = 500*batch_size,
model = create_model()
model.fit(x = x_train,
y = y_train,
epochs = 3,
validation_data = (x_test, y_test),
batch_size = batch_size,
callbacks = [checkpoint_5000]) verbose = 1)
有什么建议可以让它发挥作用吗?除了降级我的 tensorflow。
参数save_freg
太大。它需要等于或小于 save_freg = training_samples // batch_size
。也许尝试这样的事情:
batch_size = 10
checkpoint_5000_path = 'model_checkpoints_5000/cp_{epoch:02d}-{batch:1d}'
checkpoint_5000 = ModelCheckpoint(filepath = checkpoint_5000_path,
save_weights_only = True,
save_freq = len(x_train) // batch_size // batch_size,
verbose = 1)
model = create_model()
model.fit(x = x_train,
y = y_train,
epochs = 3,
validation_data = (x_test, y_test),
batch_size = batch_size,
callbacks = [checkpoint_5000])
Epoch 1/3
97/1000 [=>............................] - ETA: 3s - loss: 2.2801 - accuracy: 0.1536
Epoch 00001: saving model to model_checkpoints_5000/cp_01-100
198/1000 [====>.........................] - ETA: 3s - loss: 2.2347 - accuracy: 0.1500
Epoch 00001: saving model to model_checkpoints_5000/cp_01-200
288/1000 [=======>......................] - ETA: 3s - loss: 2.1979 - accuracy: 0.1736
Epoch 00001: saving model to model_checkpoints_5000/cp_01-300
397/1000 [==========>...................] - ETA: 2s - loss: 2.1337 - accuracy: 0.2020
Epoch 00001: saving model to model_checkpoints_5000/cp_01-400
497/1000 [=============>................] - ETA: 2s - loss: 2.0952 - accuracy: 0.2197
Epoch 00001: saving model to model_checkpoints_5000/cp_01-500
598/1000 [================>.............] - ETA: 1s - loss: 2.0496 - accuracy: 0.2395
Epoch 00001: saving model to model_checkpoints_5000/cp_01-600
698/1000 [===================>..........] - ETA: 1s - loss: 2.0122 - accuracy: 0.2520
Epoch 00001: saving model to model_checkpoints_5000/cp_01-700
703/1000 [====================>.........] - ETA: 1s - loss: 2.0082 - accuracy: 0.2538
...
在此示例中,每个纪元每 x 步创建一个检查点。