加载的keras模型无法继续训练，尺寸不匹配

Question

我将 tensorflow 与 keras 结合使用，以使用 google colabs 训练到 char-RNN。我训练我的模型 10 个时期并保存它，使用 'model.save()'，如 documentation for saving models. Immediately after, I load it again just to check, I try to call model.fit() on the loaded model and I get a "Dimensions must be equal" error using the exact same training set. The training data is in a tensorflow dataset organised in batches as shown in the documentation for tf datasets 所示。这是一个最小的工作示例：

import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

X = np.random.randint(0,50,(10000))

seq_len = 150
batch_size = 20
dataset = tf.data.Dataset.from_tensor_slices(X)
dataset = dataset.batch(seq_len+1,drop_remainder=True)
dataset = dataset.map(lambda x: (x[:-1],x[1:]))
dataset = dataset.shuffle(20).batch(batch_size,drop_remainder=True)

def make_model(vocabulary_size,embedding_dimension,rnn_units,batch_size,stateful):
  model = Sequential()
  model.add(Embedding(vocabulary_size,embedding_dimension,
                      batch_input_shape=[batch_size,None]))
  model.add(LSTM(rnn_units,return_sequences=True,stateful=stateful))
  model.add(Dense(vocabulary_size))
  model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                optimizer='adam',metrics=['accuracy'])
  model.summary()
  return model

vocab_size = 51
emb_dim = 20
rnn_units = 10
model = make_model(vocab_size,emb_dim,rnn_units,batch_size,False)

model.fit(dataset,epochs=10)
model.save('/content/test_model')
model2 = tf.keras.models.load_model('/content/test_model')
model2.fit(dataset,epochs=10)

第一行训练“model.fit()”运行正常，但最后一行 returns 出现错误：

ValueError: Dimensions must be equal, but are 20 and 150 for '{{node Equal}} = Equal[T=DT_INT64, incompatible_shape_error=true](ArgMax, ArgMax_1)' with input shapes: [20], [20,150].

我希望稍后能够恢复训练，因为我的真实数据集要大得多。因此，只保存权重并不是一个理想的选择。

有什么建议吗？谢谢！

Answer 1

如果您保存了检查点，那么您可以从这些检查点恢复使用减少的数据集。您的神经网络/层和维度应该相同。

Answer 2

问题出在 'accuracy' 指标上。出于某种原因，正如我在 this thread (see last comment). Running model.compile() on the loaded model with the same metric allows training to continue. However, it shouldn't be necessary to compile the model again. Moreover, this means that the optimiser state is lost, as explained in 中发现的那样，当模型加载该指标时，预测的维度会出现一些错误处理，因此，这对于恢复训练不是很有用。

另一方面，从一开始就使用 'sparse_categorical_accuracy' 效果很好。我能够加载模型并继续训练而无需重新编译。事后看来，考虑到我最后一层的输出是字符分布的对数，这个选择更合适。因此，这不是二分类问题，而是多分类问题。尽管如此，我还是验证了 'accuracy' 和 'sparse_categorical_accuracy' 在我的特定示例中返回了相同的值。因此，我相信 keras 在内部将精度转换为分类精度，但是在刚刚加载的模型上执行此操作时会出现问题，这迫使需要重新编译。

我还验证了如果保存的模型是用 'accuracy' 编译的，加载模型并用 'sparse_categorical_accuracy' 重新编译将允许恢复训练。然而，如前所述，这会丢弃优化器的状态，我怀疑这不会比仅仅制作一个新模型并仅从保存的模型中加载权重更好。

加载的keras模型无法继续训练，尺寸不匹配

Loaded keras model fails to continue training, dimensions mismatch

lstm

keras

tensorflow

recurrent-neural-network

resuming-training