保存并加载自定义 Tensorflow 模型（自回归 seq2seq 多元时间序列 GRU/RNN）

Question

我正在尝试实现一个自回归 seq-2-seq RNN 来预测时间序列数据，as shown in this TensorFlow tutorial。该模型由一个自定义模型 class 组成，继承自 tf.keras.Model，其代码可以在下面找到。我已将此模型用于时间序列预测，输入数据为 (15, 108) 数据集（维度：（序列长度，输入单位）），输出数据为 (10, 108) 数据集。

虽然训练成功，我没有成功保存并重新加载模型以在测试集上评估之前训练的模型。我尝试在互联网上寻找解决方案, 但到目前为止，其中 none 似乎有效。这可能是因为它是使用急切执行训练的自定义模型，因为多线程无法解决在这些条件下保存模型的问题。

谁能告诉我如何解决这个问题。非常感谢任何帮助，谢谢！

到目前为止，我已经使用 tf.keras.models.load_model(filepath) 加载了模型并尝试了以下保存选项。两个选项的代码可以在下面找到：

使用 keras.callbacks.ModelCheckpoint 函数保存。但是，只返回了一个 .ckpt.data-00000-of-00001 和一个 .ckpt.index 文件（因此没有 .meta 或 .pb 文件），我无法打开这些文件
使用 tf.saved_model.save 函数保存并加载导致以下错误的模型：


    WARNING:tensorflow:Looks like there is an object (perhaps variable or layer) that is shared between different layers/models. This may cause issues when restoring the variable values. Object: <tensorflow.python.keras.layers.recurrent_v2.GRUCell object at 0x7fac1c052eb8>
    WARNING:tensorflow:Inconsistent references when loading the checkpoint into this object graph. Either the Trackable object references in the Python program have changed in an incompatible way, or the checkpoint was generated in an incompatible program.
    
    Two checkpoint references resolved to different objects (<tensorflow.python.keras.layers.recurrent_v2.GRUCell object at 0x7fac20648048> and <tensorflow.python.keras.layers.recurrent_v2.GRUCell object at 0x7fac1c052eb8>).
    ---------------------------------------------------------------------------
    AssertionError                            Traceback (most recent call last)
    <ipython-input-7-ac3fac428428> in <module>()
          1 model = '/content/drive/My Drive/Colab Notebooks/Master thesis/NN_data/saved_model/s-20210208-194847'
    ----> 2 new_model = tf.keras.models.load_model(model)
    
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/saving/save.py in load_model(filepath, custom_objects, compile, options)
        210       if isinstance(filepath, six.string_types):
        211         loader_impl.parse_saved_model(filepath)
    --> 212         return saved_model_load.load(filepath, compile, options)
        213 
        214   raise IOError(
    
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/saving/saved_model/load.py in load(path, compile, options)
        142   for node_id, loaded_node in keras_loader.loaded_nodes.items():
        143     nodes_to_load[keras_loader.get_path(node_id)] = loaded_node
    --> 144   loaded = tf_load.load_partial(path, nodes_to_load, options=options)
        145 
        146   # Finalize the loaded layers and remove the extra tracked dependencies.
    
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/saved_model/load.py in load_partial(export_dir, filters, tags, options)
        763     A dictionary mapping node paths from the filter to loaded objects.
        764   """
    --> 765   return load_internal(export_dir, tags, options, filters=filters)
        766 
        767 
    
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/saved_model/load.py in load_internal(export_dir, tags, options, loader_cls, filters)
        888       try:
        889         loader = loader_cls(object_graph_proto, saved_model_proto, export_dir,
    --> 890                             ckpt_options, filters)
        891       except errors.NotFoundError as err:
        892         raise FileNotFoundError(
    
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/saved_model/load.py in __init__(self, object_graph_proto, saved_model_proto, export_dir, ckpt_options, filters)
        159 
        160     self._load_all()
    --> 161     self._restore_checkpoint()
        162 
        163     for node in self._nodes:
    
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/saved_model/load.py in _restore_checkpoint(self)
        486     else:
        487       load_status = saver.restore(variables_path, self._checkpoint_options)
    --> 488     load_status.assert_existing_objects_matched()
        489     checkpoint = load_status._checkpoint
        490 
    
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/tracking/util.py in assert_existing_objects_matched(self)
        806           ("Some Python objects were not bound to checkpointed values, likely "
        807            "due to changes in the Python program: %s") %
    --> 808           (list(unused_python_objects),))
        809     return self
        810 
    
    AssertionError: Some Python objects were not bound to checkpointed values, likely due to changes in the Python program: [<tf.Variable 'gru_cell_2/bias:0' shape=(2, 648) dtype=float32, numpy=
    array([[0., 0., 0., ..., 0., 0., 0.],
           [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)>, <tf.Variable 'gru_cell_2/kernel:0' shape=(108, 648) dtype=float32, numpy=
    array([[ 0.01252341, -0.08176371, -0.00800528, ...,  0.00473534,
            -0.05456369,  0.00294461],
           [-0.02453795,  0.018851  ,  0.07198527, ...,  0.05603079,
            -0.01973856,  0.06883802],
           [-0.06897871, -0.05892187,  0.08031332, ...,  0.07844239,
            -0.06783205, -0.04394536],
           ...,
           [ 0.02367028,  0.07758808, -0.04011653, ..., -0.04074041,
            -0.00352754, -0.03324065],
           [ 0.08708382, -0.0113907 , -0.08592559, ..., -0.07780273,
            -0.07923603,  0.0435034 ],
           [-0.04890796,  0.03626117,  0.01753877, ..., -0.06336015,
            -0.07234246, -0.05076948]], dtype=float32)>, <tf.Variable 'gru_cell_2/recurrent_kernel:0' shape=(216, 648) dtype=float32, numpy=
    array([[ 0.03453588,  0.01778516, -0.0326081 , ..., -0.02686813,
             0.05017178,  0.01470701],
           [ 0.05364531, -0.02074206, -0.06292176, ..., -0.04883411,
            -0.03006711,  0.03091787],
           [ 0.03928262,  0.01209829,  0.01992464, ..., -0.01726807,
            -0.04125096,  0.00977487],
           ...,
           [ 0.03076804,  0.00477963, -0.03565286, ..., -0.00938745,
            -0.06442262, -0.0124091 ],
           [ 0.03680094, -0.04894238,  0.01765203, ..., -0.11990541,
            -0.01906408,  0.10198548],
           [ 0.00818893, -0.03801145,  0.10376499, ..., -0.01700275,
            -0.02600842, -0.0169891 ]], dtype=float32)>]

用于（成功）训练和保存模型的缩短代码：


    model = FeedBack(units=neurons, out_steps=output_len, num_features=108, act_dense=output_activation)
      
    model.compile(loss=loss,optimizer=tf.optimizers.Adam(lr=lr), metrics=['mean_absolute_error', 'mean_absolute_percentage_error', keras.metrics.RootMeanSquaredError()])
    
    cp_callback = keras.callbacks.ModelCheckpoint(filepath=checkpoint_path, save_best_only=True, verbose=0)
    earlyStopping = keras.callbacks.EarlyStopping(monitor='val_loss', patience=6, verbose=0,  min_delta=1e-9, mode='auto')
    
    # OPTION 1: USE ModelCheckpoint
    r = model.fit(x=train_x, y=train_y, batch_size=32, shuffle=False, epochs=1,validation_data = (test_x, test_y), callbacks=[earlyStopping, cp_callback], verbose=0)
        
    # OPTION 2: USE tf.saved_model.save()
    !mkdir -p saved_model
    model.save('/content/drive/My Drive/Colab Notebooks/Master thesis/NN_data/saved_model/s-%s' % timestring)
    tf.saved_model.save(model, '/content/drive/My Drive/Colab Notebooks/Master thesis/NN_data/saved_model/s-%s' % timestring)

这是构建模型时使用的代码：


    class FeedBack(tf.keras.Model):
        def __init__(self, units, out_steps, num_features, act_dense):
            super().__init__()
            self.out_steps = out_steps
            self.units = units
            self.num_features = num_features
            self.act_dense = act_dense
            self.gru_cell = tf.keras.layers.GRUCell(units)
            # Also wrap the LSTMCell in an RNN to simplify the `warmup` method.
            self.gru_rnn = tf.keras.layers.RNN(self.gru_cell, return_state=True)
            self.dense = tf.keras.layers.Dense(num_features, activation=act_dense) #self.num_features?
    
        def warmup(self, inputs):
            # inputs.shape => (batch, time, features)
            # x.shape => (batch, lstm_units)
            x, state = self.gru_rnn(inputs)
            
            # predictions.shape => (batch, features)
            prediction = self.dense(x)
            return prediction, state
    
        def call(self, inputs, training=None):
            # Use a TensorArray to capture dynamically unrolled outputs.
            predictions = []
            # Initialize the lstm state
            prediction, state = self.warmup(inputs)
    
            # Insert the first prediction
            predictions.append(prediction)
    
            # Run the rest of the prediction steps
            for _ in range(1, self.out_steps):
                # Use the last prediction as input.
                x = prediction
                # Execute one gru step.
                x, state = self.gru_cell(x, states=state,
                                                                    training=training)
                # Convert the gru output to a prediction.
                prediction = self.dense(x)
                # Add the prediction to the output
                predictions.append(prediction)
    
            # predictions.shape => (time, batch, features)
            predictions = tf.stack(predictions)
            # predictions.shape => (batch, time, features)
            predictions = tf.transpose(predictions, [1, 0, 2])
            return predictions

Answer 1

我会说问题出在您提供给 ModelCheckpoint 回调的文件路径上，它应该是一个 hdf5 文件。

以我为例：


ckpt_name = '/work/.../weights/{}.hdf5'.format(log_name)

...
callbacks = [
            TensorBoardImage(...),
            tf.keras.callbacks.ModelCheckpoint(filepath=ckpt_name)
        ]
...
model.fit(train_generator, validation_data=validation_generator, validation_freq=1, epochs=FLAGS['epochs'],
                    callbacks=callbacks)

Answer 2

问题的根源，想一想，是在 __init__ 中你将 gru_cell 包裹在 layers.RNN 中。这导致相同的 gru_cell 被使用两次：一次在 warmup() 中，然后在 call() 中再次使用。对于训练，这不是问题，但正如您所注意到的，保存模型时会失败。

用 layers.GRU

替换您的自定义 RNN 层

改变这个：

def __init__(self, units, out_steps, num_features, act_dense):
    ...
    self.gru_cell = tf.keras.layers.GRUCell(units)
    # Also wrap the LSTMCell in an RNN to simplify the `warmup` method.
    self.gru_rnn = tf.keras.layers.RNN(self.gru_cell, return_state=True)
    ...

为此：

def __init__(self, units, out_steps, num_features, act_dense):
    ...
    self.gru_cell = tf.keras.layers.GRUCell(units)
    self.gru_rnn = tf.keras.layers.GRU(units, return_state=True)
    ...

(编辑)
注意： gru_cell 和 gru_rnn 层将不会像在原始代码中那样共享权重。从这个意义上说，原始版本更可取，因为相同的 GRUCell 对整个序列进行操作。

在我的版本中，layers.GRU 对输入序列进行操作，之后状态将传递给 layers.GRUCell。这有一个缺点，即 layers.GRUCell 的权重必须单独优化（学习）并且不会受益于使用与 layers.GRU 相同的权重，反之亦然。

保存并加载自定义 Tensorflow 模型（自回归 seq2seq 多元时间序列 GRU/RNN）

Save and Load Custom Tensorflow Model (Autoregressive seq2seq multivariate time series GRU/RNN)

python-3.x

deep-learning

keras

tensorflow

recurrent-neural-network